In this blog post, I would like to discuss academic teaching with you; Why teaching is frowned upon by many researchers, and how it can actually help them be better at doing research by teaching great courses.

Researchers at institutions with teaching assignments often find themselves faced with a dilemma when it comes to teaching: How can they spend precious time teaching that brings them no immediate benefit that would help them with their career (writing academic papers and getting funding)? Don’t get me wrong here; I assume many professors and other teaching staff know perfectly well that teaching is one of if not the most important duties of any university. It serves as a sacred haven of knowledge where the half-gods of science descent into the earthly realms and bless the students with their knowledge and wisdom. Leaving biblical metaphors aside, we can also clear why so many professors frown upon the teaching responsibilities they have, and most probably wish they could assign this responsibility to someone else (which they sometimes do). But how blame them? It is true that not many incentives exist for teaching staff to spend more time on teaching, or — more importantly — do better teaching. Sure, some universities honor great lectures crafted with skill and foresight by highly motivated and talented teaching staff, with teaching prizes, but this is clearly the minority and I argue that most of these great teachers would have done so even without the prices — they are simply motivated by their own account.

It is a curious fact that professors stand in front of hundreds of young students and try to shovel knowledge into their heads while typically having absolutely no formal qualification for doing so.

They are good at publishing papers and collecting funds, otherwise, they would probably not have gotten that position, but they don’t know how to explain their research field to newcomers. While school teachers spend many years of their training honing their didactics skills and practicing in actual classes, teaching staff at universities mostly study the academic content they are later expected to digest and pass on to their students. That being said, I am wondering how academia can motivate researchers to do better teaching by directly rewarding good teaching? Could “teaching credits” be directly converted to citations, the most important external measurement of success of a researcher? Or even converted to direct funding? I am uncertain if such measures actually help to alleviate the problem of unmotivated and bad teaching at universities. Maybe students should not expect good teaching after all since they should be able to learn concepts without pre-digestion by a teacher anyway? Regardless of what the ideal teaching concepts in academia might look like, I am trying to make the point that making an effort to make good and inspiring lectures can actually help lecturers with their own academic career regardless of extrinsic motivation by the academic system.

In part 1, we discussed the deficiencies of today’s academic institutions when it comes to teaching obligations. It is true that there is no obvious path out of this situation and changes will come slowly, as always in academia. However, if you are a teaching researcher, I would like to point you towards possible novel views that you can apply today which might help you to do a better job at teaching.

- Teaching can bring you new ideas that you had not before: Technical discussions with bright students can be very fruitful both for teachers and students. Every student brings a different background to the table and provides personal insight into the problem. While most questions by students might not cause immediate new ideas, we all know the “Mh… I have never thought of it like that before”-feeling that sometimes leads to a new way of investigating a problem.
- If you can’t explain it simply, you don’t understand it well enough. This famous Einstein quote, in a nutshell, can serve as a pretty good rule of thumb for differentiating between teachers and great teachers. If you are not able to break a concept down to its core ideas, you know that you deeply understood the concept. Applying this rule to yourself as a teacher can help you detecting deficiencies in your own understanding of the subject that you might otherwise have missed. True experts in their field and great teachers can explain a concept at multiple levels of prior knowledge while still being concise. They have truly mastered their field. For example, take a look at the following video:

- Teaching is a unique and fun experience. Teaching can be rewarding in its own way if you are the type for it. Standing in front of a class can be fun if you have motivated and bright students willing to delve into technical discussions with you.
- Teaching helps to repeat the essentials: There is a reason why you talk about topic X in your lectures, right? Every student in your course should be able to explain this topic to their colleagues after taking your course. And so should you be. Teaching helps to repeat the essentials of your subject in-depth which you might otherwise have forgotten.
- Good teaching draws talent. In my personal view, this is the most important aspect of doing great teaching. If you are able to give insightful lectures you will note an increase in applications for graduate and post-graduate students. Your lecture serves as one of the most important advertisements for your academic chair — one of the few opportunities for students to know what you are actually doing. It is important to remember that if you cannot recruit great talent, your chair will not be able to hold to its current standards once all the current graduates have left. Which will directly affect the quality of your research and your academic output in the long run.

Last but not least, I will leave you with the following beautiful quote by infamous Richard Feynman who has an inspiring story to tell about teaching in Surely You’re Joking, Mr. Feynman!

I don’t believe I can really do without teaching. The reason is, I have to have something so that when I don’t have any ideas and I’m not getting anywhere I can say to myself, “At least I’m living; at least I’m doing something; I am making some contribution” — it’s just psychological. When I was at Princeton in the 1940s I could see what happened to those great minds at the Institute for Advanced Study, who had been specially selected for their tremendous brains and were now given this opportunity to sit in this lovely house by the woods there, with no classes to teach, with no obligations whatsoever. These poor bastards could now sit and think clearly all by themselves, OK? So they don’t get any ideas for a while: They have every opportunity to do something, and they are not getting any ideas. I believe that in a situation like this a kind of guilt or depression worms inside of you, and you begin to worry about not getting any ideas. And nothing happens. Still no ideas come. Nothing happens because there’s not enough real activity and challenge: You’re not in contact with the experimental guys. You don’t have to think how to answer questions from the students. Nothing! In any thinking process there are moments when everything is going good and you’ve got wonderful ideas. Teaching is an interruption, and so it’s the greatest pain in the neck in the world. And then there are the longer period of time when not much is coming to you. You’re not getting any ideas, and if you’re doing nothing at all, it drives you nuts! You can’t even say “I’m teaching my class.”

Thank you for reading! Let me know in the comments how you feel about teaching if you are a researcher yourself, or your perception of teaching staff if you are a student!

]]>Today, we will discuss our most recent paper ** HeatNet: Bridging the Day-Night Domain Gap in Semantic Segmentation with Thermal Images** (by Johan Vertens, Jannik Zürn, and Wolfram Burgard). This post serves as a quick and dirty introduction to the topic and to the work itself. For more details, please refer to the original publication, which is available here and the project website will be available soon at http://thermal.cs.uni-freiburg.de/.

Robust and accurate semantic segmentation of urban scenes is one of the enabling technologies for autonomous driving in complex and cluttered driving scenarios. Recent years have shown great progress in RGB image segmentation for autonomous driving which were predominantly demonstrated in favorable daytime illumination conditions.

*Fig. 1: During nighttime, thermal images provide important additional data that is not present in the visible spectrum. Note the contrast between vegetation and sky/clouds and the bright spots on the left and right, indicating pedestrians.*

*Fig. 2: In daytime scenes, thermal images can also provide important information where the dynamic range of standard RGB cameras does not suffice (vegetation in front of the bright sky). Note the temperature gradient between car or truck and road.*

While the reported results demonstrate high accuracies on benchmark datasets, these models tend to generalize poorly to adverse weather conditions and low illumination levels present at nighttime. This constraint becomes especially apparent in rural areas where artificial lighting is weak or scarce. In autonomous driving, to ensure safety and situation awareness, robust perception in these conditions is a vital prerequisite.

In order to perform similarly well in challenging illumination conditions, it is beneficial for autonomous vehicles to leverage modalities complementary to RGB images. Encouraged by prior work in thermal image processing, we investigate leveraging thermal images for nighttime semantic segmentation of urban scenes. Thermal images contain accurate thermal radiation measurements with a high spatial density. Furthermore, thermal radiation is much less influenced by sunlight illumination changes and is less sensitive to adversary conditions. Existing RGB-thermal datasets for semantic image segmentation are not as large-scale as their RGB-only counterparts. Thus, models trained on such datasets generalize poorly to challenging real-world scenarios.

In contrast to previous works, we utilize a semantic segmentation network for RGB daytime images as a teacher model, trained in supervised fashion on the Mapillary Vistas Dataset, to provide labels for the RGB daytime images in our dataset. We project the thermal images into the viewpoint of the RGB camera images using extrinsic and intrinsic camera parameters that we determine using our novel targetless camera calibration approach. Afterward, we can reuse labels from this teacher model to train a multimodal semantic segmentation network on our daytime RGB-thermal image pairs. While the thermal modality is mostly invariant to lighting changes, the RGB modality differs significantly between daytime and nighttime and thus exhibits a significant domain gap. In order to encourage day-night invariant segmentation of scenes, we simultaneously train a feature discriminator that aims at classifying features in the semantic segmentation network to belong either to daytime or nighttime images. This helps aligning the internal feature distributions of the multimodal segmentation network, enabling the network to perform similarly well for nighttime images as for daytime images. Furthermore, we propose a novel training schedule for our multimodal network that helps to align the feature representations between day and night. As thermal cameras are not yet available in most autonomous platforms, we further propose to distill the knowledge from the domain adapted multimodal model back into a unimodal segmentation network that exclusively uses RGB images.

*Fig. 3: Our proposed HeatNet architecture uses both RGB and thermal images and is trained to predict segmentation masks in daytime and nighttime domains. We train our model with daytime supervision from a pre-trained RGB teacher model and with optional nighttime supervision from a pre-trained thermal teacher model trained on exclusively thermal images. We simultaneously minimize the cross-entropy prediction loss to the teacher model prediction and minimize a domain confusion loss from a domain discriminator to reduce the domain gap between daytime and nighttime images.*

*Fig 4: Our stereo RGB and thermal camera rig mounted on our data collection vehicle.*

To kindle research in the area of thermal image segmentation and to allow for credible quantitative evaluation, we create the large-scale dataset Freiburg Thermal. We provide the dataset and the code publicly available at http://thermal.cs.uni-freiburg.de/. The Freiburg Thermal dataset was collected during 5 daytime and 3 nighttime data collection runs, spanning the seasons summer through winter. Overall, the dataset contains 12051 daytime and 8596 nighttime time-synchronized images using a stereo RGB camera rig (FLIR Blackfly 23S3C) and a stereo thermal camera rig (FLIR ADK) mounted on the roof of our data collection vehicle. In addition to images, we recorded the GPS/IMU data and LiDAR point clouds. The Freiburg Thermal dataset contains highly diverse driving scenarios including highways, densely populated urban areas, residential areas, and rural districts. We also provide a testing set comprising 32 daytime and 32 nighttime annotated images. Each image has pixelwise semantic labels for 13 different object classes. Annotations are provided for the following classes: Road, Sidewalk, Building, Curb, Fence, Pole/Signs, Vegetation, Terrain, Sky, Person/Rider, Car/Truck/Bus/Train, Bicycle/Motorcycle, and Background. We deliberately selected extremely challenging urban and rural scenes with many traffic participants and changing illumination conditions.

For our segmentation approach it is important to perfectly align RGB and thermal images as otherwise, the RGB teacher model predictions would not be valid as labels for the thermal modality. Thus, in order to accurately carry out the camera calibration for the thermal camera, we propose a novel targetless calibration procedure. While in previous works, different kinds of checkerboards or circle boards have been leveraged, our method does not require any pattern. Although for RGB cameras, these patterns can be produced and utilized easily, it still remains a challenge to create patterns that are robustly visible both in RGB and thermal images. In general, the used modalities infrared and RGB entail different information. However, we note that the edges of most common objects in urban scenes are easily observable in both modalities. Thus, in our approach, we minimize the pixel-wise distance between such edges. In the case of aligning two monocular cameras, targetless calibration without any prior information results in ambiguities for the estimation of the intrinsic camera parameters. We, therefore, utilize our pre-calibrated RGB stereo rig in order to provide the missing sense of scale. Due to the target-less nature of our approach, our thermal camera calibration method can be easily deployed in an online calibration scenario.

We report the performance of HeatNet trained on Freiburg Thermal and tested on Freiburg Thermal, MF, and on the BDD (Berkeley Deep Drive) night test split. We observe that our RGB Teacher model, which is trained on the Vistas dataset, has a high mIoU score of 69.4 in the day domain and an expected low score of 35.7, as the network is neither trained nor adapted to the night domain.

Our thermal teacher model MN achieves a mIoU score of 57.0, which shows that the domain gap is much smaller for this domain as for RGB. Our final RGB-T HeatNet model achieves with 64.9 the overall best score on our test set. Furthermore, the RGB-only HeatNet reaches a comparable score to our RGB-T variant, proving the efficiency of our distillation approach which leverages the thermal images as a bridge modality.

*Fig 5: Our stereo RGB and thermal camera rig mounted on our data collection vehicle.*

We deploy the same distilled RGB network to publish results on the night BDD split. It can be observed that our method boosts mIoU by 50%. In order to compare the performance of our network with the recent RGB-T semantic segmentation approaches MFNet and RTFNet-50, we also fine-tune our model on the 784-image MF training set and report scores on the corresponding test set. We select all classes that are compatible between MF and Freiburg Thermal for evaluation which are the classes Car, Person, and Bike. We train our method only with labels provided by the teacher model MD, while not requiring any nighttime labels or labels from MF in general. Thus, it is expected that MFNet and RTFNet outperform HeatNet as they are trained supervised. However, it can be observed that HeatNet achieves comparable numbers to MFNet.

We further evaluate the generalization properties of the models trained on MF and tested on our FR-T dataset. We observe that the model performance deteriorates when evaluating MFNet or RTFNet on our FR-T dataset. We conclude that the diversity and complexity of the MF dataset do not suffice to train robust and accurate models for daytime or nighttime semantic segmentation of urban scenes.

*Fig 6: Qualitative semantic segmentation results of our model variants. We compare segmentation masks of our RGB-only teacher model, HeatNet RGB-only, and HeatNet RGB-T to ground truth. In the first two rows, we show segmentation masks obtained on the Freiburg Thermal dataset. The bottom row illustrates results obtained on the RGB-only BDD dataset. The multimodal approaches cannot be evaluated on BDD and the corresponding images are left blank.*

In this work, we presented a novel and robust approach for daytime and nighttime semantic segmentation of urban scenes by leveraging both RGB and thermal images. We showed that our HeatNet approach avoids expensive and cumbersome annotation of nighttime images by learning from a pre-trained RGB-only teacher model and by adapting to the nighttime domain. We further proposed a novel training initialization scheme by first pre-training our model with a daytime RGB-only teacher model and a nighttime thermal-only teacher model and subsequently fine-tuning the model with a domain confusion loss. We furthermore introduced a first-of-its-kind large-scale RGB-T semantic segmentation dataset, including a novel target-less thermal camera calibration method based on image gradient alignment maximization. We presented comprehensive quantitative and qualitative evaluations on multiple datasets and demonstrated the benefit of the complimentary thermal modality for semantic segmentation and for learning more robust RGB-only nighttime models.

]]>This post is more technical than my usual posts. If you have any questions about the research, please post questions in the comments. Thanks!

Recent advances in robotics and machine learning have enabled the deployment of autonomous robots in challenging outdoor environments for complex tasks such as autonomous driving, last-mile delivery, and patrolling. Robots operating in these environments encounter a wide range of terrains from paved roads and cobblestones to unstructured dirt roads and grass. It is essential for them to be able to reliably classify and characterize these terrains for safe and efficient navigation. This is an *extremely challenging problem* as the visual appearance of outdoor terrain drastically changes over the course of days and seasons, with variations in lighting due to weather, precipitation, artificial light sources, dirt or snow on the ground. Therefore, robots should be able to actively perceive the terrains and adapt their navigation strategy as solely relying on pre-existing maps is insufficient.

Most state-of-the-art learning methods require a significant amount of data samples which are often arduous to obtain in supervised learning settings where labels have to be manually assigned to data samples. Moreover, these models tend to degrade in performance once presented with data sampled from a distribution that is not present in the training data. In order to perform well on data from a new distribution, they have to be retrained after repeated manual labeling which in general is unsustainable for the widespread deployment of robots. Self-supervised learning allows the training data to be labeled automatically by exploiting the correlations between different input signals thereby reducing the amount of manual labeling work by a large margin.

Furthermore, unsupervised audio classification eliminates the need to manually label audio samples. We take a step towards lifelong learning for visual terrain classification by leveraging the fact that the distribution of terrain sounds does not depend on the visual appearance of the terrain. This enables us to employ our trained audio terrain classification model in previously unseen visual perceptual conditions to automatically label patches of terrain in images, in a completely self-supervised manner. The visual classification model can then be fine-tuned on the new training samples by leveraging transfer learning to adapt to the new appearance conditions.

*Fig. 1: Our self-supervised approach enables a robot to classify urban terrains without any manual labeling using an on-board camera and a microphone. Our proposed unsupervised audio classifier automatically labels visual terrain patches by projecting the traversed tracks into camera images. The resulting sparsely labeled images are used to train a semantic segmentation network for visually classifying new camera images in a pixel-wise manner.*

In our work, we present a novel self-supervised approach to visual terrain classification by exploiting the supervisory signal from an unsupervised proprioceptive terrain classifier utilizing vehicle-terrain interaction sounds. Fig. 1 illustrates our approach where our robot equipped with a camera and a microphone traverses different terrains and captures both sensor streams along its trajectory. The poses of the robot recorded along the trajectory enables us to associate the visual features of a patch of ground that is in front of the robot initially with its corresponding auditory features when that patch of ground is traversed by the robot. We split the audio stream into small snippets and embed them into an embedding space using metric learning. To this end, we propose a novel triplet sampling method based on the visual features of the respective terrain patches. This now enables the usage of triplet loss formulations for metric learning without requiring ground truth labels. We obtained the aforementioned visual features from an off-the-shelf image classification model pre-trained on the ImageNet dataset. To the best of our knowledge, our work is the first to exploit embeddings from one modality to form triplets for learning an embedding space for samples from an extremely different modality. We interpret the resulting clusters formed by the audio embeddings as labels for training a weakly-supervised visual semantic terrain segmentation model. We then employ this model for pixel-wise classification of terrain that is in front of the robot and use this information to build semantic terrain maps of the robot environment.

*Fig. 2: The five terrain types along with a birds-eye-view image and the corresponding spectrogram of the vehicle-terrain interaction sound from the five different terrain classes*

In this section, we detail our proposed self-supervised terrain classification framework. Fig. 3 visualizes the overall information flow in our system. While acquiring the images and audio data, we tag each sample with the robot pose
obtained using our SLAM system. We then project the camera images into a birds-eye-view perspective and project the path traversed by the robot in terms of its footprint into this viewpoint. We transform the audio clips into a spectrogram representation and embed them into an embedding space using our proposed Siamese Encoder with Reconstruction loss on audio triplets that uses features in the visual domain for triplet forming. Subsequently, we cluster the embeddings and use the cluster indices to automatically label the corresponding robot path segments in the birds-eye-view images. The resulting labeled images serve as weakly labeled training data for the
semantic segmentation network. **Note that the entire approach is executed completely in an unsupervised manner**. The cluster indices can be used to indicate terrain labels such as Asphalt and Grass or in terms of terrain properties.

*Fig. 3: Overview of our proposed self-supervised terrain classification framework. The upper part of the figure illustrates our novel Siamese Encoder with Reconstruction loss (SE-R), while the lower part illustrates how the labels obtained from the SE-R are used to automatically annotate data for training the semantic segmentation network. The camera images are first projected into a birds-eye-view perspective of the scene and the trajectory of the robot is projected into this viewpoint. In our SE-R approach, using both the audio clips from the recorded terrain traversal and the corresponding patches of terrain recorded with a camera, we embed each clip of the audio stream into an embedding space that is highly discriminative in terms of the underlying terrain class. This is performed by forming triplets of audio samples using the visual similarity of the corresponding patches of ground obtained with a pre-trained feature extractor. We then cluster the resulting audio embeddings and use the cluster indices as labels for self-supervised labeling. The resulting labeled images serve as a weakly labeled training dataset for a semantic segmentation network for pixel-wise terrain classification.*

We will now briefly discuss the major steps in the processing pipeline.

We record the stream of monocular camera images from an on-board camera and the corresponding audio stream of the vehicle-terrain interaction sounds from a microphone mounted near the wheel of our robot. We project the robot trajectory into the image coordinates using the robot poses obtained using our SLAM system. We additionally perform perspective warping of the camera images in order to obtain a birds-eye view representation.

Each terrain patch that the robot traverses is represented by two modalities: **sound **and **vision**. We obtain the visual representation of a terrain patch from a distance using an on-board camera, while we record the vehicle-terrain interaction sounds by traversing the corresponding terrain patch. For our unsupervised acoustic feature learning approach, we exploit the highly discriminative visual embeddings of terrain patch images obtained using a CNN pre-trained on the ImageNet dataset to form triplets of audio samples. To form such discriminative clusters of embeddings, triplet losses have been proposed. We argue that the relative position of a terrain patch image embeddings in embedding space serves as a good approximation for ground truth labels that have previously been relied on for triplet forming. We form triplets of audio clips using this heuristic. Finally, we train our Siamese Encoder with reconstruction loss in order to embed these audio clips into a highly discriminative audio embedding space.

*Fig. 4: Two-dimensional t-SNE visualizations of the audio samples embedded with our SE-R approach after 0, 10, 30, and 90 epochs of training. The color of the points indicate the corresponding ground truth class. We observe that clusters of embeddings are clearly separable as the training progresses and they highly correlate with the ground truth terrain class.*

The triplet loss enforces that the embeddings of samples with the same label are pulled together in embedding space and embeddings of samples with different labels are pushed away from each other simultaneously. As the ground truth labels of the audio samples are not available to form triplets, we argue that an unsupervised heuristic can serve as a substitute signal for the ground truth labels for triplet creation: the local neighborhood of the terrain image patch embeddings. We obtain rectangular patches of terrain by selecting segments of pixels along the robot path. The closest neighbor in the embedding space has a high likelihood of belonging to the same ground truth class as the anchor sample. Therefore, for sampling triplets, we select the sample with the smallest euclidean distance in the visual embedding space as a positive sample. We then select negative samples by randomly selecting samples that are in a different cluster in visual embedding space than the anchor sample. Although it cannot be always guaranteed that the negative sample does not have the same ground truth class, it has a high likelihood of belonging to a different class, which we observe in our experiments. Likewise, we argue that visually similar terrain patches have a high likelihood of belonging to the same class. This means that in practice a fraction of the generated triplets are not correctly formed. However, we empirically find that it is sufficient if the majority all triplets have correct class assignments as they outweigh the incorrectly defined triplets.

Finally, we perform k-means clustering of the embeddings to separate the samples into K clusters, corresponding to the K terrain classes present in the dataset. Our approach only requires us to set the number of terrain classes that are present and assign terrain class names to the cluster indices.

We use the resulting weakly self-labeled birds-eye-view scenes to train a semantic segmentation network in a self- supervised manner. A self-supervisory signal can be obtained for every image pixel that contains a part of the robot path for which the label is known from the unsupervised audio clustering. Note that the segmentation masks for the traversed terrain types are incomplete as the robot cannot be expected to traverse every physical location of terrain in the view to generate complete segmentation masks. We alleviate this challenge by considering all the pixels in camera images that do not contain the robot path as a background class that does not contribute to the segmentation loss. We deal with the class imbalance in the training set by weighing each class proportional to its log frequency in the training data set.

We will briefly discuss some of the results reported in the original publication.

Fig. 6 illustrates some qualitative terrain classification results for a small clip in the dataset. We observe that a majority of pixels in each frame are assigned the correct labels. Some errors occur for terrains that are partially covered with objects (bikes in this scene) or have non-favorable lighting conditions.

*Fig. 6: Qualitative terrain classification results for a small clip in the dataset.*

For more qualitative and quantitative results, please refer to the original publication.

One of the major advantages of our self-supervised approach is that new labels on previously unseen terrains can easily be generated by the robot automatically. While the terrain traversal sounds do not substantially vary with the weather conditions other than rain and winds, the visual appearance of terrain can vary depending on several factors including time of day, season or cloudiness. We record data at dusk with low light conditions and artificial lighting resulting in a variation in terrain hues and substantial motion blur. We qualitatively compare the terrain classification results for a model trained exclusively on the Freiburg Terrains dataset and a model trained jointly on the Freiburg Terrains dataset as well as on the new low light dataset. Qualitative results from this experiment are shown in Fig. 7.

*Fig. 7: Qualitative results on a new low light dataset that was captured at dusk that has a considerable amount of motion blur, color noise, and artificial lighting. We show a comparison between the terrain classification model without and with fine-tuning on training data created using our self-supervised approach.*

We finally demonstrate the utility of our proposed self-supervised semantic segmentation framework for building semantic terrain maps of the environment. To build such a map, we use the poses of the robot that we obtain using our SLAM system and the terrain predictions of the birds-eye-view camera images. We project each image into the correct location in a global map using the 3-D camera pose and we use no additional image blending or image alignment optimization. For each single birds-eye-view image, we generate pixel-wise terrain classification predictions using our self-supervised semantic segmentation model. We then project these segmentation mask predictions into their corresponding locations in the global semantic terrain map, similar to the procedure that we employ for the birds-eye-view images. When there are predictions of a terrain location from multiple views, we choose the class with the highest prediction count for each pixel in the map. We also experimented with fusing the predictions from multiple views using Bayesian fusion which yields similar results. Fig. 8 shows how a semantic terrain map can be built from single camera images and the corresponding semantic terrain predictions of our approach. It can be observed that our self- supervised terrain segmentation model yields predictions that are for the most part globally consistent.

*Fig. 8: Tiled birds-eye-view images and the corresponding semantic terrain maps built from the predictions of our self-supervised semantic terrain segmentation model. We use the SLAM poses of the camera to obtain the 6-D camera poses for each frame.*

In this work, we proposed a self-supervised terrain classification framework that exploits the training signal from an unsupervised proprioceptive terrain classifier to learn an exteroceptive classifier for pixel-wise semantic terrain segmentation. We presented a novel heuristic for triplet sampling in metric learning that leverages a complementary modality as opposed to the typical strategy that requires ground truth labels. We employed this proposed heuristic for unsupervised clustering of vehicle-terrain interaction sound embeddings and subsequently used the resulting clusters formed by the audio embeddings for self-supervised labeling of terrain patches in images. We then trained a semantic terrain segmentation network from these weak labels for dense pixel-wise classification of terrains that are in front of the robot.

Thanks for reading!

]]>The machine learning library TensorFlow has had a long history of releases starting from the initial open-source release from the Google Brain team back in November 2015. Initially developed internally under the name DistBelief, TensorFlow quickly rose to become the most widely used machine learning library today. And not without reason.

GitHub repository stars over time for the most widely used machine learning libraries |

Before we discuss the most important changes for TensorFlow 2.0, let us quickly recap the some of the essential aspects of TensorFlow 1.XX:

Python was the first client language supported by TensorFlow and currently supports the most features within the TensorFlow ecosystem. Nowadays, TensorFlow is available in a multitude of programming languages. The TensorFlow core is written in pure C++ for better performance and is exposed via a C API. Apart from the bindings to Python2.7/3.4–3.7, TensorFlow also offers support for JavaScript (Tensorflow.js), Rust and R. Especially the syntactically simple Python API, compared to the brittle explicitness of C/C++ allowed TensorFlow to quickly overtake the Caffe machine learning library, an early-day competitor.

From the start, the core of TensorFlow has been the so-called Computation Graph. In this graph model, each operation (Add, Multiply, Subtract, Logarithmize, Matrix-Vector Algebra, Complex functions, broadcasting, …) and also Variables/Constants are defined by a node in a directed graph. The directed edges of the Graph connect nodes to each other and define in which direction information/data flows from one node to the next. There are Input-Nodes where information is fed into the Computation Graph from outside, and Output-Nodes that output the processed data.

After the Graph has been defined, it can be executed on data that is fed into the Graph. Thus, the data *flows* through the graph, changes its content and shape, and is transformed into the output of the Graph. The data can usually be expressed as a multidimensional array, or Tensor, thus the name TensorFlow.

Using this model, it is easy to define the architecture of a neural network using these nodes. Each layer of a neural network can be understood as a special node in the computation graph. There are many pre-defined operations in the TensorFlow API, but users can of course define their own custom operations. But keep in mind that arbitrary computations can be defined using a computation graph, not only operations in the context of machine learning.

Graphs are invoked by as TensorFlow session: `tf.Session()`

. A session can take run options as arguments, such as the number of GPUs the graph should be executed on, the specifics of memory allocation on the GPU and what not. Once the necessary data is available, it can be fed into the the computation graph using the `tf.Session.run()`

method in which all the magic happens.

In order to train a neural network, using an optimization algorithm such as Stochastic Gradient Descent, we need the definitions of the gradients of all operations in the network. Otherwise, performing backpropagation on the network is not possible. Luckily, TensorFlow offers automatic differentiation for us, such that we only have to define the forward-pass of information through the network. The backward-pass of the error through all layers is inferred automatically. This feature is not unique with TensorFlow — all current ML libraries offer automatic differentiation.

From the start, the focus of TensorFlow was to let the Computing Graph execute on GPUs. Their highly parallel architecture offers ideal performance for excessive matrix-vector arithmetic which is necessary for training machine learning libraries. The NVIDIA CUDA (**C**ompute **U**nified **D**evice **A**rchitecture) API allows TensorFlow to execute **arbitrary** operations on a NVIDIA GPU.

There are also projects with the goal to expose TensorFlow to any OpenCL-compatible device (i.e. also AMD GPUs). However, NVIDIA still remains the clear champion in Deep Learning GPU hardware, not the least due to the success of CUDA+TensorFlow.

Getting a working installation of CUDA on your machine, including CuDNN and the correct NVIDIA drivers for your GPU can be a *painful* experience (especially since not all TensorFlow versions are compatible with all CUDA/CuDNN/NVIDIA driver versions and you were too lazy to have a look at the version compatibility pages), however, once TensorFlow can use your GPU(s), you will recognize a significant boost in performance.

Large-scale machine learning tasks require access to more than one GPU in order to yield results quickly. Large enough deep neural networks have too many parameters to fit them all into a single GPU. TensorFlow lets users easily declare on which devices (GPU or CPU) the computation graph should be executed.

Multi-GPU computation model (source: https://www.tensorflow.org/tutorials/images/deep_cnn) |

The TensorFlow Computation Graph is a powerful model for processing information. However, a major point of criticism from the start was the difficulty of debugging such graphs. With statements such as

```
a = tf.Constant(1.0, dtype=tf.float32)
b = tf.Constant(3.0, dtype=tf.float32)
c = a + b
```

the content of the variable c is not 4.0, as one might expect, but rather a TensorFlow node with no definite value assigned to it yet. The validity of such a statement (and the possible bugs introduced by the statement) can only be tested after the Graph was invoked and a session was run on the Graph.

Thus, TensorFlow released the eager execution mode, for which each node is immediately executed after definition. Statements using tf.placeholder are thus no longer valid. The eager execution mode is simply invoked using `tf.eager_execution()`

after importing TensorFlow.

TensorFlow’s eager execution is an imperative programming environment that evaluates operations immediately, without building graphs: operations return concrete values instead of constructing a computational graph to run later. The advantages of this approach are easier debugging of all computations, natural control flow using Python statements instead of graph control flow, and an intuitive interface. The downside of eager mode is the reduced performance since graph-level optimizations such as common subexpression elimination and constant-folding are no longer available.

The TensorFlow Debugger (tfdbg) lets you view the internal structure and states of running TensorFlow graphs during training and inference, which is difficult to debug with general-purpose debuggers such as Python’s dbg to TensorFlow’s computation-graph paradigm. It was conceived as an answer to criticism regarding the difficulty in debugging TensorFlow programs. There is both a command-line interface and a Debugging plugin for TensorBoard (more info below) that allows you to inspect the computation graph for debugging. For a detailed introduction, please find https://www.tensorflow.org/guide/debugger.

You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data such as images that pass through it during training or inference. It is definitely the way to go if you wish to visualize any kind of data that is available during within the computation graph. While TensorBoard was originally introduced as part of TensorFlow, it now lives in its own GitHub repository. However, it will be installed automatically when installing TensorFlow itself.

TensoBoard is not only useful for visualizing training or evaluation data such as losses/accuracies as a function of the number of steps, but also for visualizing image data or sound waveforms. The best way to get an overview of TensorBoard is to have a look at https://www.tensorflow.org/guide/summaries_and_tensorboard.

TPUs (Tensor Processing Units) are highly-parallel computing units specifically designed to efficiently process multi-dimensional arrays (a.k.a. **Tensors**), which is particularly useful in Machine Learning. Due to their application-specific integrated circuit (ASIC) design, they are the fastest processors for machine learning applications available today. As of today, Google’s TPUs are proprietary and are not commercially available for any private consumers or businesses. They are part of the Google Compute Engine, where you can rent compute instances that have access to TPUs for your large-scale machine learning needs. Needless to say that Google aims at making every TensorFlow operation executable on a TPU device to further strengthen its position in the ever-growing cloud computing market.

You can, however, test the performance of a single TPU for yourself in Google Colab, a platform that can host and execute Jupyter Notebooks, with access to CPU/GPU or TPU instances on the Google Compute Engine, for free! For a small introduction, click here.

While neural network training typically happens on powerful hardware with sometimes multiple GPUs, neural network inference usually happens locally on consumer devices (unless the raw data is streamed to another cloud service, and inference happens there) such as the onboard computers of autonomous cars or even mobile phones. NVIDIA offers a module called TensorRT that takes a TensorFlow Graph of a trained neural network expressed using the TensorFlow API and converts it to a Computation Graph specifically optimized for inference. This usually results in a significant performance gain compared to inference within TensorFlow itself. For an introduction to TensorRT, click here.

TensorFlow has a vibrant community on GitHub that added quite some functionality to the core and the peripherals of TensorFlow (obviously a strong argument for Google to open-source TensorFlow). Most of these modules are collected in the `tf.contrib`

module. Due to the high market share of TensorFlow, quite a few modules can be found here that you would otherwise have to implement yourself.

TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models. A module is a **self-contained piece** of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks in a process known as transfer learning. Fore more details please find https://www.tensorflow.org/hub.

There is so much more to talk about. Which components of the TensorFlow ecosystem should at least be mentioned?

**TensorFlow Docker container**: Docker containers containing pre-installed TensorFlow, including CUDA compatibility for graph execution on GPUs from within the Docker container**TensorFlow Lite**: TensorFlow Lite is an open source deep learning framework for on-device inference on devices such as embedded systems and mobile phones.**TensorFlow Extended (TFX)**: TFX is a Google-production-scale machine learning platform based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.

One of the strengths of TensorFlow, the Computation Graph, is arguably also one of its weaknesses. While the static computation Graph definitely boosts performance (since graph-level optimizations can happen after the graph is built and before it is executed), it also makes debugging the graph difficult and cumbersome — even with tools such as the TensorFlow Debugger. Also, benchmarks have shown that several other frameworks can compete on equal terms with TensorFlow, while keeping a simpler syntax. Additionally, first building a graph and then instantiating it using tf.Sessions is not very intuitive and definitely scares or bewilders some inexperienced users.

The TensorFlow API arguably also has weaknesses, i.e. discussed here. Some users complain about the low-level-feeling when using the TensorFlow API, even when solving a high-level task. Much boilder-plate code is needed for simple tasks such as training a linear classifier.

After delving into the depths of TensorFlow 1.XX, what will change with the big 2? Did the TensorFlow team respond to some of the criticism of the past? And what justifies calling it version 2.0, and not 1.14?

In multiple blog posts and announcements, some of the future features of TF2.0 have been revealed. Also, the TF2.0 API reference lists have already been made publicly available. While TF2.0 is still in alpha version, it is expected that the official Beta, Release Candidates, and the final release will be made available later this year.

Let’s have a closer look at some of the **novelties** of TF2.0:

`tf`

, hello `tf.keras`

For a while, TensorFlow has offered the tf.keras API as part of the TensorFlow module, offering the same syntax as the Keras machine learning library. Keras has received much praise for its simple and intuitive API for defining network architectures and training them. Keras integrates tightly with the rest of TensorFlow so you can access TensorFlow’s features whenever you want. The Keras API makes it easy to get started with TensorFlow. Importantly, Keras provides several model-building APIs (Sequential, Functional, and Subclassing), so you can choose the right level of abstraction for your project. TensorFlow’s implementation contains enhancements including eager execution, for immediate iteration and intuitive debugging, and `tf.data`

, for building scalable input pipelines.

Training data is read using input pipelines which are created using tf.data. This will be the preferred way of declaring input pipelines. Pipelines using tf.placeholders and feed dicts for sessions will still work under the TensorFlow v1 compatibility mode, but will no longer benefit from performance improvements in subsequent tf2.0 versions.

TensorFlow 2.0 runs with eager execution (discussed previously) by default for ease of use and smooth debugging.

`tf.contrib`

Most of the modules in tf.contrib will depreciate in tf2.0 and will be either moved into core TensorFlow or removed altogether.

`tf.function`

decoratorThe tf.function function decorator transparently translates your Python programs into TensorFlow graphs. This process retains all the advantages of 1.x TensorFlow graph-based execution: Performance optimizations, remote execution and the ability to serialize, export and deploy easily while adding the flexibility and ease of use of expressing programs in simple Python. In my opinion, this is the biggest change and paradigm shift from v1.X to v2.0.

`tf.Session()`

When code is eagerly executed, sessions instantiating and running computation graphs will no longer be necessary. This simplifies many API calls and removes some boilerplate code from the codebase.

It will still be possible to run tf1.XX code in tf2 without any modifications, but this does not let you take advantage of many of the improvements made in TensorFlow 2.0. Instead, you can try running a conversion script that automatically converts the old tf1.XX calls to tf2 calls, if possible. The detailed migration guide from tf1 to tf2 will give you more information if needed.

I hope you liked this small overview, and see you next time!

Happy Tensorflowing!

Autoencoders are structured to take an input, transform this input into a different representation, an *embedding* of the input. From this embedding, it aims to reconstruct the original input as precicely as possible. It basically tries to copy the input. The layers of the autoencoder that create this embedding are called the **encoder**, and the layers that try to reconstruct the embedding into the original input are called **decoder**. Usually Autoencoders are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize
which aspects of the input should be copied, it often learns useful properties of the data.

More formally, an autoencoder describes a nonlinear mapping of an input \(\mathbf{x}\) into an output \(\tilde{\mathbf{x}}\) using an intermediate representation \(x_{encoded} = f_{encode}(\mathbf{x})\), also called an *embedding*. The embedding is typically denoted as \(h\) (h for hidden, I suppose). During training, the encoder learns a nonlinear mapping of \(\mathbf{x}\) into \(\mathbf{x}_{encoded}\). The decoder, on the other hand, learns a nonlinear mapping from \(x_{encoded}\) into the original space. The goal of training is to minimize a loss. This loss describes the objective that the autoencoder tries to reach. When our goal is to merely reconstrut the input as accurately as possible, two major types of loss function are typically used: Mean squared error and Kullback-Leibler (KL) divergence.

The **mean squared error (MSE)** is (as its name already suggests) defined as the mean of the squared difference between our network output and the ground truth. When the encoder output is a grid of values *a.k.a. an image*, the MSE between output image \(\bar{I}\) and ground truth image \(I\) may be defined as

The notion of **KL divergence** comes originally from information theory and describes the relative entropy between two probability distributions \(p\) and \(q\). Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.

The KL divergence has many useful properties, most notably that it is non-negative. The KL divergence is 0 if and only if \(p\) and \(q\) are the same distribution in the case of discrete variables, or equal *almost everywhere* in the case of continuous variables. It is defined as:

In the context of Machine Learning, minimizing the KL divergence means to make the autoencoder sample its output from a distribution that is similar to the distribution of the input, which is a desirable property of an autoencoder.

Autoencoders come in many different flavors. For the purpose of this post, we will only discuss the most important concepts and ideas for autoencoders. Most Autoencoders you might encounter in the wild are *undercomplete* autoencoders. This means that the condensed representation of the input can hold less information than the input has. If your input has \(N\) dimensions, and some hidden layer of your autoencoder has only \(X < N\) dimensions, your autoencoder is undercomplete. Why would you want to hold less information in the hidden layer than your input might contain? The idea is that restricting the amount of information the encoder can put into the the encoded representation forces it to only focus on the relevant and discriminative information within the input since this allows the decoder to reconstruct the input as best as possible. Undercomplete autoencoder *boil the information down* into the most essential bits. It is a form of *Dimensionality reduction*.

Now, let us discuss some flavors of autoencoders that you might encounter “in the wild”:

The most basic example of an autoencoder may be defined with an input layer, a hidden layer, and an output layer:

A simple autoencoder (image credit: [2]) |

The Input layer typically has the same dimensions as the output layer since we try to reconstruct the content of the input, while the hidden layer has a smaller number of dimensions that input or output layer.

However, depending on the purpose of the encoding scheme, it can be useful to add an additional term to the loss function that needs to be satisfied as well.

Sparse autoencoders, as their name suggests, enforce sparsity on the embedding variables. This can be achieved by means of a sparsity penalty \(\Omega(\mathbf{h})\) on the embedding layer \(\mathbf{h}\).

\[\begin{align*} loss = \mathcal{L}(f_{encode}(f_{decode}(\mathbf{x})), \mathbf{x}) + \Omega(\mathbf{h}) \end{align*}\]The operator \(\mathcal{L}\) denotes an arbitray distance metric (i.e. MSE or KL-divergence) between input and output. The sparsity penalty may be expressed the \(L_1\)-norm of the hidden layer weights:

\[\begin{align*} \Omega(\mathbf{h}) = \lambda \sum_i | h_i | \end{align*}\]with a scaling parameter \(\lambda\). Enforcing sparsity is a form of regularization and can improve the generalization abilities of the autoencoder.

As the name suggests, a *denoising* autoencoder is able to robustly remove noise from images. How can it achieve this property? It finds feature vectors that are somewhat invariant to noise in the input (within a reasonable SNR).

A denoising autoencoder can very easily be constructed by modifying the loss function of a vanilly autoencoder. Instead of calculating the error between the original input \(\mathbf{x}\) and the reconstructed input \(\tilde{\mathbf{x}}\), we calculate the error between the original input and the reconstruction of an input \(\hat{\mathbf{x}}\) that was corrupted by some form of noise. For a MSE loss definition, this can be defined as:

\[\begin{align*} loss = \mathcal{L} \big( \mathbf{x}, f_{encode}(f_{decode}(\hat{\mathbf{x}})) \big) \end{align*}\]Denoising autoencoders learn undo this corruption rather than simply copying their input.

Denoised images (Source: [1]) |

A contrative autoencoder is another subtype of a sparse autoencoder (we impose an additional constraint on the reconstruction loss). For this type of autoencoder, we penalize the weights of the embedding layer by

\[\begin{align*} \Omega(\mathbf{h}) = \lambda \sum_i ||\nabla_x h_i||^2 \end{align*}\]The operator \(\nabla\) denotes the Nabla-operator, meaning a gradient. Specifically, we penalize large gradients of the hidden layer activations \(h_i\) w.r.t the input \(x\). But what purpose might this constraint have?

Loosely speaking, it lets infinitesimal changes w.r.t. the input \(\mathbf{x}\) not have any influence on the embedding variables. If make small changes to the pixel intensities of the input images, we do not want any changes to the embedding variables. It is encouraged to map a **local neighborhood of input points** to a **smaller local neighborhood of output points**.

And what is this useful for, you ask? The goal of the CAE is to learn the manifold structure of the data in the high-dimensional input space. For example, a CAE applied to images should learn tangent vectors that show how the image changes as objects in the image gradually change pose. This property would not be emphasised as much in a standard loss function.

Variational Autoencoders (VAE) learn a

latent variable modelfor its input data So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a “generative model”. [1]

In contrast to a “normal” autoencoder, a VAE turns a sample not into one parameter (the embedding representation), but in two parameters \(z_{\mu}\) and \(z_{\sigma}\), that describe the mean and the standard deviation of a latent normal distribution that is assumed to generate the data the VAE is trained on.

The parameters of the model are trained via two loss terms: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term.

For a proper introduction into VAEs, see for instance [3].

Let us now see how we can embed data in some latent dimensions. In this first experiment, we will strive for something very simple. We first create a super monotonous dataset consisting of many different images of random blocks with different heights and widths, we will call it the **block image dataset**.

Let us train a VAE with only two latent dimensions on 80000 of these block images and see what happens. I chose to use only two latent dimensions because each image can be visualized by the location of its latent embedding vector in a 2-D plane.

The figure below shows which feature vector in the 2-D plane corresponds to which block image. The block image is drawn at the location where its feature vector lies in the 2-D plane.

Sampling the 2-D latent features on a uniform grid |

It is quite obvious that the autoencoder was able to find a mapping that makes a lot of sense for our dataset. Recall that each input datum (a single image) has \(height \cdot width \cdot channels = 28 \cdot 28 \cdot 1 = 784\) dimensions. The autoencoder was able to reduce the dimensionality of the input to only two dimensions without losing a whole lot of information since the output is visually almost indistinguishable from the input (apart from some minor artefacts). This astounding reconstruction quality is possible since each input image is so easy to describe and does not contain very much information. Each white block can be described by only two parameters: height and width. Not even the center of each block is parametrized since each block is located exactly in the center of each image.

If you want to play around with this yourself, you can find the code here. Most of the code was taken from the Keras github repository.

While embedding those simple blocks might seem like a nice gimmick, let us now see how well an autoencoder actually performs on a real-world dataset: **Fashion MNIST**.
Our goal here is to find out how descriptive the embedding vectors of the input images are. Autoencoders allow us to compare visual image similarity by comparing the similarity of their respective embeddings or **features** created by the Autoencoder.

As the Fashion MNIST images are much more information-dense than the block-images from our last mini experiment, we assume that we need more latent variables in order to express the gist of each of the training images. I chose a different autoencoder architecture with 128 latent dimensions.

The idea is to to create a feature vector for a query image for which we want similar results from a database. Below we can see an exemplary query image of a (very low-res) pair of jeans.

Using our autoencoder that was trained on the Fashion MNIST dataset, we want to retrieve the images corresponding to the features that are closest to the query image features in the embedding space. But how can we compare the **closeness** of two vectors? For this kind of task, one typically uses the **cosine distance** between the vectors as a distance metric.

Here are the four closest neighbors of the query image in feature space:

Nice! Pants are similar to pants, I guess!

Lets try again with a new query image:

And here are the four closest neighbors:

Cool! The autoencoder was definitely able to encode the relevant information of each input image in the embedding layer.

**Note**: You should bear in mind that the autoencoder was not trained with any labels for the images. It does not “know” that these are images of shirts. In only knows that the abstract features of all of these images are roughly similar and highly descriptive of the actual image content.

If you want to play around with this yourself, the code may be found here.

We have seen how autoencoders can be constructed and what types of autoencoders have been proposed in the last couple of years.

The ability of Autoencoders to encode high-level image content information in a dense, small feature vector makes them very useful for unsupervised pretraining. We can automatically extract highly useful feature vectors from input data completely unsupervised. Later we may use these feature vectors to train an off-the-shelf classifier with these features and observe highly competitive results.

This aspect is especially useful for learning tasks where there is not very much labeled data but very much unlabeled data.

Thanks for reading and happy autoencoding! 👨💻🎉

We will use the previously discussed Concept-Math-Code (C-M-C) approach to gain drive our process of understanding RL.

Broadly speaking, Reinforcement Learning allows an autonomous agent to learn to make intelligent choices about how it should interact with its environment in order to maximize a reward. The environment can be as simple as a single number expressing a measurement the agent takes, or it can be as complex as a screenshot of the game DOTA2, which the agent learns to play (see https://openai.com/).

For every discrete time step, the agent perceives the state \(s\) of his environment and chooses an action \(a\) according to its policy. The agent then receives a reward \(r\) for its action and the environment transitioned into the next state \(s’\).

The feedback loop of RL (image credit: [1]) |

In order to make RL work, the agent does not necessarily need to know the inner workings of the environment (i.e. it does not need a model of its environment predicting future states of the environment based on the current state). However, learning speed increases if the agent incorporates as much knowledge about the environment *a priori* as possible.

Q-Learning was a big breakout in the early days of Reinforcement-Learning. The idea behind Q-Learning is to assign each Action-State pair a value — the Q-value — quantifying an estimate of the amount of reward we might get when we perform a certain action when the environment is in a certain state. So, if we are in a state S, we just pick the action that has the highest assigned Q-value as we assume that we receive the highest reward in return. Once we performed action a, the environment is in a new state S’ and we can measure the reward we actually received in return for performing action a. Once we measured the reward for performing action a, we can then update the Q-values of the Action-Space pair since we now know which rewards we actually received by the environment for performing action a. How the Q-values are actually updated after every time step, we will discuss in the Math section this post.

You might already have noticed a wide-open gap in the Q-Learning algorithm: How the heck are we supposed to know the Q-values of a state-action pair? We might consider updating a table in which we save the Q-values of each state-action pair. Every time we take an action in the environment, we store a new Q-value for a state-action pair. These actions do at first not even have to make sense or lead to high rewards since we are only interested in building up a table of Q-values which we can use to make more intelligent decisions later on. But what if the state S in which an environment has high dimensionality or is sampled from a continuous space? We cannot expect our computer to store infinite amounts of data. What can we do? Neural Networks to the rescue!

Instead of updating a possibly infinite table of state-action pairs and their respective Q-values, let’s use a Deep Neural Network to map a state-action pair to a Q-value, hence the name Deep Q-Learning.

If the network is trained sufficiently well, it is able to tell with high confidence what Q-values certain actions might have given a state S in which the environment currently is. While this sounds super easy and fun, this approach suffers from instability issues and divergence. Two main mechanisms were introduced by Mnih et al. in 2015 in order to avoid these issues: Experience Replay and Frozen Optimization Target. Briefly stated, experience replay allows the network to learn on a single experience \(e_{t+1}\) consisting of a state \(s_t\), an action \(a_t\), reward \(r_t\), and new state \(s_{t+1}\) tuple more than one time. The term **“frozen optimization target”**, in contrast, refers to the fact that the Q-value estimation network used for predicting future Q-values is not the same as the network used for training. Every \(N\) steps, the values of the trained network are copied to the network being used to predict future Q-values. It was found that this procedure leads to much less instability issues during training.

We have established the concepts behind Q-Learning and why Deep Q-Learning solves the issue with storing possibly infinite numbers of state-action pairs. Let us now briefly dive into some of the math involved in Deep Q Learning.

During the training of the neural network, the Q-values of each state-value pair is updated using the following equation:

\[\begin{align*} Q(S_{t+1}, A_{t+1}) = Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a \in A} Q(S_{t+1}, a) - Q(S_t, A_t) ] \end{align*}\]Let us first discuss the term in square brackets. The variable \(R_{t+1}\) denotes the reward given to the agent at time step \(t+1\). The next addend denotes the maximal Q-value for all possible actions a while the environment is in the new state \(S_{t+1}\). Beware that this value can also only be an estimate of the true maximal Q-value as we can only estimate future Q-values of state-action pairs. This maximal Q-value is multiplied by a so-called discount factor (denoted gamma). This factor decides how much weight we assign to future rewards in comparison to the currently achieved reward. If the discount factor equals zero, only the current reward matters to the agent, and no future rewards matter for estimating future Q-values. The discount factor is typically set to a value of about \(0.95\). The last addend in the square brackets is simply the current Q-value estimate again. Thus, the term in the square brackets as a whole expresses the *difference between the predicted Q-value and our best estimate for the true Q-value*. Bear in mind that obviously also our best estimate for the true Q-value might not be totally perfect since we only know the reward for the next time step \(R_{t+1}\) for sure, but this little bit of knowledge helps us to improve the Q-value estimate for the current time step.

This whole difference-term in square brackets is multiplied by the learning rate alpha that weighs how much we trust this estimated error between the best estimate of the true Q-value and the predicted Q-value. The bigger we choose alpha, the more we trust this estimate. The learning rate is typically between \(0.01\) and \(0.0001\).

With our understanding of the Q-value update, the loss of a Q-value prediction can be described as follows:

\[\begin{align*} \mathcal{L} (\theta) = E_{(s,a,r,s') \sim U(D)} \Big[ \big(r + \gamma \max_{a'} Q(s', a'; \theta^{-}) - Q(s, a; \theta) \big)^2 \Big] \end{align*}\]Do not let yourself be intimidated by the term before the brackets on the right-hand-side of the equation. This part basically means that we randomly sample a state-action-reward-new_state \((s,a,r,s’)\) tuple from the replay memory called \(D\). The term within the square brackets defines the mean squared error between the actually observed reward r added to all expected future rewards beginning from the next time step (including the discount factor gamma) AND the actually by the neural network predicted Q-value. Pretty simple, right? During training, the prediction of the Q-values should become increasingly better (however, strong fluctuations during learning do usually happen).

These already are the most important bits of math in Deep Q-Learning. We could of course discuss some of the beauty of Linear Algebra and Calculus involved in the Neural Network estimating the Q-values. However, this is beyond the scope of this post and smarter people have done a much better job at explaining this (i.e. the truly awesome 3Blue1Brown).

We consider an agent trying to survive in a flappy-bird-like environment. The agent has to find the hole in a moving wall coming its way in order to survive. The goal is to survive as long as possible (sorry, dear agent, but there is no happy end for you). For every time-step the agent receives a reward of 0.1. When the agent learns to maximize its reward, it consequently learns to survive as long as possible and find the holes in the moving walls.

White agent and grey moving walls (hard to see in this static image) |

The action space consists of three possible actions: **{Move up, move down, stay}**. The environment consists of an 80-by-80 grey-scale pixel grid. The agent is indicated in grey color, the moving walls are indicated with white color. The agent is supposed to learn optimal behavior by just looking at the raw pixels without any further knowledge about the world. This approach is both the slowest to learn but also the most general approach since no rules have to be hard-coded into the learning system (model-free Reinforcement Learning). During training, the network learns to map the information of the raw pixels of the environment to the Q-values of all three possible actions. The policy we implemented always selects the action with the highest associated Q-value.

Without further ado, let’s have a look at the code. All relevant bits are (hopefully) commented well enough.

We could use one of the many pre-defined environments that come with an installation of the OpenAI gym Python package. However, we could also quickly write our own small environment. As already mentioned, I implemented a crappy Flappy Bird clone featuring a square-shaped robot trying to avoid the oncoming walls.

```
import gym
from gym import spaces
import numpy as np
from gym import utils
from random import randint
class Obstacle:
def __init__(self):
self.hole_top = randint(0, 30)
self.hole_bottom = self.hole_top + 10
self.pos_x = 40
def reset(self):
self.hole_top = randint(0, 30)
self.hole_bottom = self.hole_top + 10
self.pos_x = 40
def step(self):
self.pos_x -= 1 # increment position
if self.pos_x < 0: # reset obstacle if outside environment
self.reset()
def set_pos_x(self, pos_x):
self.pos_x = pos_x
def get_pos(self):
return self.pos_x
def get_hole(self):
return self.hole_top, self.hole_bottom
class Robot:
def __init__(self):
self.height = 0
def move(self, direction):
if direction == 0 and self.height > 0:
self.height -= 2 # move up
if direction == 1 and self.height < 40-5:
self.height += 2 # move down
if direction == 2:
self.height = self.height # stay
def set_height(self, height):
self.height = height
def get_height(self):
return self.height
def get_x(self):
return 20
def reset(self):
self.height = randint(5, 35)
class RoadEnv(gym.Env, utils.EzPickle):
def __init__(self):
from gym.envs.classic_control import rendering
self.viewer = rendering.SimpleImageViewer()
self._action_set = {0, 1} # go up, go down
self.action_space = spaces.Discrete(len(self._action_set))
# init obstacle
self.obstacle = Obstacle()
# init robot
self.robby = Robot()
# if game is over, it resets itself
def reset_game(self):
self.robby.reset()
self.obstacle.reset()
# a single time step in the environment
def step(self, a):
reward, game_over = self.act(a)
ob = self._get_obs()
info = {}
return ob, reward, game_over, info
# perform action a
def act(self, a):
self.obstacle.step()
self.robby.move(a)
rob_pos_y = self.robby.get_height()
rob_pos_x = self.robby.get_x()
top, bottom = self.obstacle.get_hole()
obstacle_pos_x = self.obstacle.get_pos()
distance_x = abs(rob_pos_x - obstacle_pos_x)
collide_x = distance_x < 5
collide_y = rob_pos_y < top or (rob_pos_y + 5 > bottom)
game_over = False
reward = 0.0
if collide_x and collide_y:
game_over = True
else:
reward = 0.1
return reward, game_over
@property
def _n_actions(self):
return len(self._action_set)
def reset(self):
self.reset_game()
return self._get_obs()
def _get_obs(self):
img = self._get_image
# image must be expanded along first dimension for keras
return np.expand_dims(img, axis=0)
def render(self):
img = self._get_image
# image must be expanded to 3 color channels to properly show the content
img = np.repeat(img, 3, axis=2)
# show frame on display
self.viewer.imshow(img)
return self.viewer.isopen
@property
def _get_image(self):
img = np.zeros(shape=(40, 40, 1), dtype=np.uint8)
obstacle_x = self.obstacle.get_pos()
width = 4
img[:, obstacle_x:obstacle_x + width, 0] = 128
top, bottom = self.obstacle.get_hole()
img[top:bottom, obstacle_x:obstacle_x + width, 0] = 0
rob_y = self.robby.get_height()
rob_x = self.robby.get_x()
img[rob_y:rob_y + width, rob_x:rob_x + width, 0] = 255
return img
def close(self):
if self.viewer is not None:
self.viewer.close()
self.viewer = None
```

```
import random
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Activation, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import model_from_yaml
from tensorflow.keras.models import load_model
from collections import deque
class DQNAgent:
def __init__(self, state_size, action_size, model_dir=None):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
if model_dir:
# loading stored model archtitecture and model weights
self.load_model(model_dir)
else:
# creating model from scratch
self.model = self._build_model()
def _build_model(self):
seqmodel = Sequential()
seqmodel.add(Conv2D(32, (8, 8), strides=(4, 4), input_shape=(40, 40, 1)))
seqmodel.add(Activation('relu'))
seqmodel.add(Conv2D(64, (4, 4), strides=(2, 2)))
seqmodel.add(Activation('relu'))
seqmodel.add(Conv2D(64, (3, 3), strides=(1, 1)))
seqmodel.add(Activation('relu'))
seqmodel.add(Flatten())
seqmodel.add(Dense(100))
seqmodel.add(Activation('relu'))
seqmodel.add(Dense(2))
adam = Adam(lr=1e-6)
seqmodel.compile(loss='mse', optimizer=adam)
return seqmodel
def remember(self, state, action, reward, next_state, done):
# store S-A-R-S in replay memory
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)
action = np.argmax(act_values[0])
return action
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma *
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
# do the learning
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
```

```
import time
import numpy as np
from collections import deque
from RoadEnv import RoadEnv
from DQNAgent import DQNAgent
# Initialize environment
env = RoadEnv()
# size of input image
state_size = 80 * 80 * 1
# size of possible actions
action_size = env.action_space.n
# Deep-Q-Learning agent
agent = DQNAgent(state_size, action_size)
# How many time steps will be analyzed during replay?
batch_size = 32
# How many time steps should one episode contain at most?
max_steps = 500
# Total number of episodes for training
n_episodes = 20000
scores_deque = deque()
deque_length = 100
all_avg_scores = []
training = True
for e in range(n_episodes):
state = env.reset()
reward = 0.0
start = time.time()
for step in range(max_steps):
done = False
action = agent.act(state)
next_state, reward_step, done, _ = env.step(action)
reward += reward_step
agent.remember(state, action, reward, next_state, done)
state = next_state
if done:
scores_deque.append(reward)
if len(scores_deque) > deque_length:
scores_deque.popleft()
scores_average = np.array(scores_deque).mean()
all_avg_scores.append(scores_average)
print("episode: {}/{}, #steps: {},reward: {}, e: {}, scores average = {}"
.format(e, n_episodes, step, reward, agent.epsilon, scores_average))
break
if training:
if len(agent.memory) > batch_size:
agent.replay(batch_size)
```

As always, you may find the complete code in the project GitHub repository.

Let us look at one episode of the agent playing without any training:

Random actions of untrained agent (sorry about the GIF artefacts) |

The agent selects actions at random as it cannot yet correlate the pixel grid with appropriate Q-values of the agent’s actions.

Pretty bad performance, I would say. How well does the agent perform after playing 20000 episodes? Let’s have a look:

** |

We begin to see intelligent behavior. The agent steers towards the holes once it is close enough. However, even after several thousands of episodes, the agent sooner or later crashes into one of the moving walls. It might be necessary to increase the number of training episodes to 100,000.

The averaged reward plot helps us to understand the training progress of the agent.

Rolling average of rewards plotted over the episode number |

A clear upward trend of the rolling average of the reward can be made out. Strong fluctuations of rewards are a typical observation in Reinforcement Learning. Other environments may lead to even stronger fluctuations, so do not let yourself be crushed if your rewards do not seem increase during training. Just wait a little longer!

That’s all I wanted to talk about for today. Please find the following exemplary resources if you want to dive deeper into the topic of Reinforcement Learning:

This post was greatly inspired by the following resources:

- [1] Lilian Weng — A (Long) Peek into Reinforcement Learning
- [2] Andrey Karpathy — Deep Reinforcement Learning: Pong from Pixels
- [3] Aurélien Géron — Hands-On Machine Learning with Scikit-Learn & TensorFlow

Thanks for reading and happy learning learning!📈💻🎉

]]>Last week, we talked about training an image classifier on the CIFAR-10 dataset using Google Colab on a Tesla K80 GPU in the cloud. This time, we will instead carry out the classifier training on a Tensor Processing Unit (TPU).

Because training and running deep learning models can be computationally demanding, we built the Tensor Processing Unit (TPU), an ASIC designed from the ground up for machine learning that powers several of our major products, including Translate, Photos, Search, Assistant, and Gmail.

TPU’s have been recently added to the Google Colab portfolio making it even more attractive for quick-and-dirty machine learning projects when your own local processing units are just not fast enough. While the Tesla K80 available in Google Colab delivers respectable 1.87 TFlops and has 12GB RAM, the **TPUv2** available from within Google Colab comes with a whopping 180 TFlops, give or take. It also comes with 64 GB High Bandwidth Memory (HBM).

In order to try out the TPU on a concrete project, we will work with a Colab notebook, in which a Keras model is trained on classifying the CIFAR-10 dataset. It can be found HERE.

If you would just like to execute the TPU-compatible notebook, you can find it HERE. Otherwise, just follow the next simple steps to add TPU support to an existing notebook.

Enabling TPU support for the notebook is really straightforward. First, let’s change the runtime settings:

And choose **TPU** as the hardware accelerator:

We also have to make minor adjustments to the Python code in the notebook. We add a new cell anywhere in the notebook in which we check that the TPU devices are properly recognized in the environment:

```
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
```

This should output a list of 8 TPU devices available in our Colab environment. In order to run the tf.keras model on a TPU, we have to convert it to a TPU-model using the `tf.contrib.tpu.keras_to_tpu`

module. Luckily, the module takes care of everything for us, leaving us with a couple of lines of boilerplate code.

```
# This address identifies the TPU we'll use when configuring TensorFlow.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tf.logging.set_verbosity(tf.logging.INFO)
resnet_model = tf.contrib.tpu.keras_to_tpu_model(
resnet_model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
```

In case your model is defined using the recently presented **TensorFlow Estimator API**, you only have to make some minor adjustments to your Estimator’s `model_fn`

like so:

```
#
# .... body of model_fn
#
optimizer = tf.train.AdamOptimizer()
if FLAGS.use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
# return tf.estimator.EstimatorSpec( # CPU or GPU estimator
# mode=mode,
# loss=loss,
# train_op=train_op,
# predictions=predictions)
return tf.contrib.tpu.TPUEstimatorSpec( # TPU estimator
mode=mode,
loss=loss,
train_op=train_op,
predictions=predictions)
```

You can find an example of a `TPUEstimator`

in the TensorFlow GitHub repository.

You should also consider increasing the batch size for training and validation of your model. Since we have 8 TPU units available, a batch size of \(8 \times 128\) might be reasonable — depending on your model’s size. Generally speaking, a batch size of \(8 \times 8^n\), \(n\) being \(1, 2, ...\) is advised. Due to the increased batch size, you can experiment with increasing the learning rate as well, making training even faster.

Compiling the TPU model and initializing the optimizer shards takes time. Depending on the Colab environment workload, it might take a couple of minutes until the first epoch and all the necessary previous initializations have been completed. However, once the TPU model is up and running, it is *lightning fast*.

Using the Resnet model discussed in the previous post, one epoch takes approximately 25 secs compared to the approx. 7 minutes on the Tesla K80 GPU, resulting in a speedup of almost **17**.

**Concept (C)****Math (M)****Code (C)**

A concept or idea of the approach is needed in order wrap your head around the thing you are trying to understand. This concept may be a very vague yet profound statement boiling down some essential scientific findings to a single sentence or a couple of sentences at most. Some scientists even argue that the shorter a scientific theory can be summarized, the more fundamental its nature can be understood. The same goes for good explanations of concepts: The simpler the better. Or, to put it like Albert Einstein:

“If you can’t explain it simply, you don’t understand it well enough.”

Take for instance the problem of heat conduction in materials. A very abstract concept of the conduction of heat may be the following statement: *“Heat flows from hot areas to cold areas”*. Already this simple statement allows us to make meaningful predictions about the future state of a system. If you touch a hot stove, you know that your hand will sooner or later become hot. You cannot make precise predictions about the concrete time span in which your hand will get or the concrete temperature your hand will have but you have a rough idea. Are more advanced concept to grasp may be the wave-particle-duality describing the alleged discrepancy between describing a particle as a wave and as a particle at the same time. Quantum mechanics allowed us to develop a new framework or concept in which a particle as both a wave and a particle at the same time. Having a good grasp of the concept of a theory is the first step in understanding this theory.

Mathematical equations describe the concept in the most abstract possible way. While the concept draws from your intuition and allows for a qualitative prediction of the future state of a system, mathematical equations allow us to precisely express quantitative properties of the system. In the context of heat conduction in materials, the 3-D heat-diffusion-equation allows us to make precise statements about the temperature u of a material as a function of space and time:

\[\begin{align*} \frac{\partial u}{\partial t} = \alpha \nabla^2 u = \alpha \Bigg( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} + \frac{\partial^2 u}{\partial z^2}\Bigg) \end{align*}\]The constant alpha denotes the *thermal diffusivity* of the material, which is defined as the thermal conductivity k over the product of the material density rho and the specific heat capacity \(c_p\):

Thinking about the math behind a concept allows you to deeply dive into the problem. It allows you to set the different parameters of the system into a relationship with another. In this example, you do not need to immediately see a possible solution to this differential equation. You might, however, understand that the spatial and temporal derivatives of the field quantity temperature are in a special relationship with one another, which is true at every point in space and at every moment in time.

Of course, other problems might have much more complex equations and might require a much higher mental workload in order for you to hold all the complex ideas in your head. Take, again, our wave-particle duality concept: The mathematical equation describing a quantum particle is a partial differential equation as well: The Schrödinger equation:

\[\begin{align*} i \hbar \frac{\partial \Psi}{\partial t} = - \frac{\hbar^2}{2 m} \nabla^2 \Psi + V \Psi \end{align*}\]This equation describes the behavior of the wave function \(\Psi\) of a quantum particle. While the Schrödinger equation is arguably more complex and difficult to grasp, it has the same fundamental properties as the heat equation: The time derivative of the function we try to obtain is in direct relationship to its second spatial derivative.

Pro-tip: Do not let yourself be intimidated by mathematical notation. It might sometimes look overwhelming but notation usually is only needed to write down an obvious statement in a water-proof fashion.

I lied to you in the last paragraph. You do not understand the math until you implemented it in code. Ok, this might be an overstatement but I heavily encourage you to implement your math in code, otherwise you might fall for the illusion of understanding your problem without actually and thoroughly understanding it.

Writing code is useful. First of all, it helps you to play around with your problem. Mathematical equations are great for expressing complex relationships between objects. However, seeing those relationships in action in your computer adds a whole new level to your understanding of the problem. It allows you to “experience” the mathematics behind the problem. It lets you play with parameters and see how the system state changes due to these differently chosen parameters.

And secondly, writing your problem in terms of code, lets you view your problem from different angles. It makes you think about the implications of possible solutions. Writing code also allows you to easily visualize your problem and the theory behind it. If you have a concise visualization of your problem, you can easily make a mental model of the system and its behavior. I deeply believe that anyone who understands a scientific theory and says that he does not have some kind of visualization of the problem in his mind does not really tell the truth.

Let us get back to our heat equation and implement a simple program to get a solution to the equation:

To make visualization easier, let us focus on one-dimensional heat conduction problems. This boils the differential equation down to the following form:

\[\begin{align*} \frac{\partial u}{\partial t} = \frac{k}{c_p \rho} \Bigg( \frac{\partial^2 u}{\partial x^2}\Bigg) \end{align*}\]First of all, we need to specify a discrete representation of the continuous differential equation of heat conduction stated previously:

\[\begin{align*} \frac{u_i^{n+1} - u_i^{n}}{\Delta t} = \alpha \frac{u_{i+1}^{n} - 2 u_i^{n} + u_{i-1}^{n}}{2 \Delta x^2} \end{align*}\]The originally continuous temperature u is discretized into discrete points in space and time. The subscripts i and the superscripts n denote the i-th spatial coordinate and n-th time step, respectively. The discrete approximation of the spatial derivative was obtained using the central differences approach while the discrete approximation of the temporal derivative was obtained using the (forward) Euler-method.

In order to find a solution, we solve the above equation for \(u_i^{n+1}\) and iterate through all time steps in an outer loop and through all spatial points in an inner loop. We also have to apply initial conditions for \(u(x, t=0)\) and boundary conditions for \(u(x=0,t)\) and \(u(x=L,t)\), while \(x=0\) and \(x=L\) denote the left and right ends of the 1-D simulation domain.

The code get a solution to this equation follows and is fairly straightforward and should be self explaining:

```
import numpy as np
import matplotlib.pyplot as plt
import time
L = 1.0 # length of 1-D heat-conducting object
Nx = 100 # number of spatial grid points
T = 10.0 # maximum time
Nt = 1000 # number of time steps
a = 0.005 # material proberty alpha
x = np.linspace(0, L, Nx+1) # mesh points in space
dx = x[1] - x[0]
t = np.linspace(0, T, Nt+1) # mesh points in time
dt = t[1] - t[0]
u = np.zeros(Nx+1) # unknown u at new time step
u_1 = np.zeros(Nx+1) # u at the previous time step
# plotting boilerplate
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0,1])
li, = ax.plot(x, u)
ax.relim()
ax.autoscale_view(True,True,True)
fig.canvas.draw()
plt.show(block=False)
# definition of initial conditions
def initial(x):
return x**2
# Set initial condition u(x,0) = initial(x)
for i in range(0, Nx+1):
u_1[i] = initial(x[i])
# loop through every time step
for n in range(0, Nt):
# Compute u at inner grid points
for i in range(1, Nx):
u[i] = u_1[i] + a*dt/(dx*dx)*(u_1[i-1] - 2*u_1[i] + u_1[i+1])
# Appöy boundary conditions
u[0] = 1.
u[Nx] = 0.
# Update u_1 before next step
u_1[:]= u
# plot every 10 time steps
if n % 10 == 0:
li.set_ydata(u)
fig.canvas.draw()
time.sleep(0.001)
plt.savefig('frames/' + str(n).zfill(3) + '.png')
```

The visualization of the solution to this simple differential equation should add a deeper understanding of the process of heat conduction through materials. The following three GIFS show the change in the material temperature over time. Boundary conditions of \(u(x=0,t) = 1.0\) and \(u(x=1.0,t) = 0.0\) are applied for all three setups.

Slow heat conduction |

Fast heat conduction |

Fast heat conduction with different initial conditions |

Playing around with the heat conduction parameters should make you develop a new intuition regarding the dynamical behavior of the system.

**As a side-note**: You might also encounter some numerical stability issues when selecting certain time step sizes:

Numerical divergence of solution due to time step size |

This problem is known as the Courant–Friedrichs–Lewy condition, which states that there is an upper limit to any time step size Delta t in finite differences schemes, which depends on the speed of information flow though the simulation domain and the length interval \(\Delta x\).

Using the C-M-C approach, you should have developed an intuitive understanding of the whole problem. Going from an abstract concept you were able to understand the math behind the problem and were thus able to implement the equations in program code. Tinkering with parameters and visualizing your results allowed you to more deeply understand the problem and even find some numerical issues with the finite differences solution scheme.

Congrats, now you are a real problem-solver!

Let me know in the comments whether you think that the C-M-C approach can be viewed as a easy-to-follow guide to help you understand difficult problems and if you what approaches to improve your understanding of a topic you use in your daily scientific struggles.

]]>Robby (big red circle) and two landmarks (smaller red circles) |

The purpose of this post is to walk you through the steps of robot localization using landmark detection and Extended Kalman Filtering.

Kalman Filtering can be understood as a way of making sense of a noisy world. When we want to determine where a robot is located, we can rely on two things: We know how the robot moves from time to time since we command it to move in a certain way. This is called state transitioning (i.e. how the robot moves from one state to the other). And we can measure the robot’s environment using its various sensors such as cameras, lidar, or echolot. The problem is that both sets of information are subject of random noise. We do not know exactly how exactly the robot transitions from state to state since actuators are not perfect and we cannot measure the distance to objects with infinite precision. This is where Kalman Filtering comes to play.

Kalman Filtering allows us to combine the uncertainties regarding the current state of the robot (i.e. where it is located and in which direction it is looking) and the uncertainties regarding its sensor measurements and to ideally decrease the overall uncertainty of the robot. Both uncertainties are usually described by a Gaussian probability distribution, or Normal distribution. A Gaussian distribution has two parameters: mean and variance. The mean expresses, what value of the distribution has the highest probability to be true, and the variance expresses how uncertain we are regarding this mean value.

The algorithm works in a two-step process. In the prediction step, the Kalman filter produces estimates of the current state variables, along with their uncertainties. Once the outcome of the next measurement (necessarily corrupted with some amount of error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with higher certainty. The algorithm is recursive. It can run in real time using only the present input measurements and the previously calculated state and its uncertainty matrix; no additional past information is required.

Since the Wikipedia image for the information flow in a Kalman Filter is so great, I cannot withheld it here:

*Kalman Filtering. Image grabbed from the Kalman wiki page: * |

2.png

I will not delve into the mathematical details of Kalman Filtering since many smart people already have done so. For a more in-depth explanation, I can recommend a stellar blog post by Tim Babb

Extended Kalman Filtering is (as the name suggests) an extension of “Normal” Kalman Filtering. What I did not tell you in the last section is one additional assumption that was made implicitly wenn using Kalman Filters: The state transition model and the measurement model must be linear. From a mathematical standpoint this means that we can use the simplicity and elegance of Linear Algebra to update the robot’s state and the robot’s measurements. In practice, this means that the state variables and measured values are assumed to change linearly over time. For instance, if we measure the robot’s position in \(x\)-direction. We assume that if the robot was at position \(x_1\) at time \(t_1\), it must be at position \(x_1 + v (t_2–t_1)\) at time \(t_2\). The variable \(v\) denotes the robot’s velocity in \(x\)-direction. If the robot is actually accelerating, or doing any other kind of nonlinear motion (e.g driving around in a circle), the state transition model is slighly wrong. Under most circumstances, it is not wrong by much, but in certain edge cases, the assumption of linearity is simply too wrong.

Also assuming a linear measurement model comes with problems. Assume you are driving along a straight road and there is a lighthouse right next to the road in front of you. While you are quite some distance away, your measurement of your distance to the lighthouse and the angle in which it lies from your perspective changes pretty much linearly (the distance decreases by roughly the speed your car has and the angle stays more or less the same). But the closer you get and especially while you drive past it, the angle, on one hand, changes dramatically, and the distance, on the other hand, does not change very much. This is why we cannot use Linear Kalman Filtering for Robby when he is navigating his 2-D world with landmarks scattered across his 2-D plane.

**Extended Kalman Filter to the rescue!** It removes the restriction of linear state transition and measurement models. Instead it allows you to use any kind of nonlinear function to model the state transition and the measurements you are making with your robot. In order to still be able to use the efficient and simple Linear Algebra magic in our filter, we do a trick: We linearize the models around the current robot state. This means that we assume the measurement model and the state transition model to be approximately linear around the state at which we are right now (refer to the road / lighhouse example again). But after every time step, we update this linearization around the new state estimate. While this approach forces us to make a linearization of this nonlinear function after every time step, it turns out to be not computationally expensive.

So there you have it. Extended Kalman Filtering is basically “Normal” Kalman Filtering just with additional linearization of the now nonlinear state transition model and measurement model.

In our case where Robby is lost and wants to localize in this (arguably) hostile environment, the Extended Kalman Filtering enables Robby to sense the landmarks and update its belief of its state accordingly. If the variance of the state estimate and the measurement estimate are low enough, Robby is very quickly very sure where he is located in respect to the landmarks and since he knows exactly where the landmarks are, he knows where he is!

His happiness-parameter is skyrocketing!

The implementation in code is fairly straightforward. For visualization purposes, I chose the SDL2 Library for a quick-and-dirty visualization of all necessary objects. It can be downloaded here:

Following an object-oriented programming approach, I implemented the following classes:

- Class
**Robot**

The Robot Class’ most important members are the Pose (x position, y position, and direction), and the Velocity (linear and angular velocity) . It can move forward, backward, and robtate left and right. For measuring the landmark positions, it has the method measureLandmarks, which takes the ground-truth landmarks, and overlays their position with fake measurement noise and returns a new list of measured landmarks.

```
class Robot {
public:
Robot(int x_start, int y_start, float orientation, int radius, SDL_Color col);
~Robot();
void render(SDL_Renderer * ren);
void move(const Uint8 * , Eigen::VectorXf & control);
void moveForward(Eigen::VectorXf & control);
void moveBackward(Eigen::VectorXf & control);
void rotateLeft(Eigen::VectorXf & control);
void rotateRight(Eigen::VectorXf & control);
void setPose(float x, float y, float phi);
Eigen::VectorXf get_state();
std::vector<Landmark> measureLandmarks(std::vector<Landmark> landmarks);
private:
Pose pose;
Velocity velocity;
SDL_Color color;
int radius;
};
```

- Class
**KalmanFilter**

The KalmanFilter class is arguably the most complex one. Its members are the matrices for state transitioning, measurements, and their respecive covariances. I will gloss over most of the details here, as the code comments give some hints about the purpose of most of the code. The filtering magic is happening in the localization_landmarks() member function.

```
class KalmanFilter {
public:
/**
* Create a Kalman filter with the specified matrices.
* A - System dynamics matrix
* C - Output matrix
* Q - Process noise covariance
* R - Measurement noise covariance
* covariance - Estimate error covariance
*/
KalmanFilter(
double dt,
const Eigen::MatrixXf& A,
const Eigen::MatrixXf& C,
const Eigen::MatrixXf& Q,
const Eigen::MatrixXf& R,
const Eigen::MatrixXf& covariance
);
/**
* Initialize the filter with a guess for initial states.
*/
void init(double t0, const Eigen::VectorXf& x0);
/**
* Update the estimated state based on measured values. The
* time step is assumed to remain constant.
*/
void update(const Eigen::VectorXf& y);
/**
* Return the current state and time.
*/
Eigen::VectorXf get_state() { return state; };
void renderSamples(SDL_Renderer * ren);
void localization_landmarks(const std::vector<Landmark> & observed_landmarks,
const std::vector<Landmark> & true_landmarks,
const Eigen::VectorXf & control);
private:
// Matrices for computation
Eigen::MatrixXf A, C, Q, R, covariance, K, P0;
// System dimensions
int m, n;
// Initial and current time
double t0, t;
// Discrete time step
double dt;
// Is the filter initialized?
bool initialized;
// n-size identity
Eigen::MatrixXf I;
// Estimated states
Eigen::VectorXf state, state_new;
};
```

- Class
**Landmark**

The Landmark class is the most simple of them all. It has a position, an ID (a unique color), and a method for rendering itself to the screen. That’s it.

```
class Landmark {
public:
Landmark(float x, float y, SDL_Color id);
~Landmark();
Position pos;
SDL_Color id;
void render(SDL_Renderer * ren);
};
```

In the main function, all we do is to initialize everything and to run an infinite loop, in which the Robot position is updated according to input from the keyboard, the robot measures its environment and the KalmanFilter does its predict and update step.

The full code can be found (as always) on my GitHub: https://github.com/jzuern/robot-localization

Happy Filtering! 🎉

]]>This series is divided into three parts.

**Part 1: A data-driven approach to CFD**

**Part 2: Implementation details**

**Part 3: Results** (this post)

In part 1, we gained a high-level overview of the data-driven approach to CFD and the steps that are needed to make it work. In the second part, we explored technical details of two essential steps of the data-driven approach to CFD: network architecture and accuracy measurements. In this final part, we will first discuss the number training samples that might be needed and the most promising type of activation function of the network. Then we will see some visualizations of the network results and finally, we will discuss the performance of the data-driven approach.

Generating a large amount of simulation data samples is a computationally demanding subtask of the data-driven approach to CFD. Due to this computational load, finding the optimal amount of training data is important. The training set must be large enough to enable the neural network to generalize to unseen geometries and avoid over-fitting its parameters, but should not be unreasonably large (data set generation would take too long) at the same time.

Final validation loss over the number of training samples |

The above figure visualizes the final validation loss for different numbers of training samples after 15000 training steps. While the final validation loss is larger than 0.1 for training samples sizes smaller than 5000 samples, the final loss converges towards 0.02 with an increased number of training samples. Increasing the number of training samples over the number of 10000 does not further reduce the final validation loss. It follows that a number of 10000 training samples is sufficient to obtain the a minimal loss both on the validation data set and the training data set.

To determine the influence of the particular choice for activation functions, four activation function types were validated:

- Exponential Linear Units (ELU)
- Concatenated Exponential Linear Units (Concat ELU)
- Rectified Linear Units (ReLU)
- Concatenated Rectified Linear Units (Concat Relu)

The below figure shows the progression of the validation loss as a function of the number of steps during training for the four tested validation functions.

Validation loss over the number of steps for different types of activation functions |

Significant spikes in validation loss are introduced during the early stages of training especially for training with the Concat ELU and ELU activation functions. Spikes are an unavoidable consequence of mini-batch gradient descent with the Adam optimizer. Some mini-batches have by chance unlucky data, inducing those spikes in the validation loss. Thus, smaller spikes in validation loss are present for all evaluated types of activation functions. The ReLU activation function was chosen due to the non-existence of big spikes in validation loss during training and due to the overall smallest validation loss towards the end of training of approximately \(0.007\).

In the below table, the final losses and accuracies of the network are listed for grid resolutions of 64 x 64 , 128 x 128, and 256 x 256 cells.

Accuracy table for different grid resolutions |

Overall, the loss values increase with increasing grid resolutions due to the fact that the loss is not normalized by the grid resolution. Thus, higher grid resolutions lead to higher losses. In the case of a 64 x 64 cells grid resolution, all measured accuracies have high values above 93%. Especially, the divergence accuracy is very high for all tested grid resolutions. It follows that the neural network predicts physically sensible solutions to the NSE for a given voxelized obstacle geometry. The cell-based, drag-based, and max-flow-based accuracies substantially decrease with increasing grid resolution. One cause for this observation might be that more than the determined necessary number of 10000 training samples are needed when using higher grid resolutions in order to allow the neural network to achieve similar accuracies to the ones measured for the lowest grid resolution. More training data allows the neural network to better approximate the more detailed simulation data contained in training data with higher grid resolutions. Another reason for the inferior prediction accuracies with higher grid sizes can be the network architecture. With higher grid sizes, the flow contains more features and cannot be estimated by the network with the same accuracy as features for lower grid sizes. Adding more residual blocks to the network might result in improved prediction accuracies.

The two images below show the ground truth vector field (obtained with a simulation) and prediction of the neural net (obtained with the data-driven approach to CFD). But which one is which?

Ground truth vector field (obtained with a simulation) and Prediction of the neural net (obtained with the data-driven approach to CFD). But which one is which? |

Visually, the vector fields for the simulation and the data-driven prediction do not differ substantially from each other for all grid resolutions. The absolute error for each cell (shown in the figure below) illustrates the domains within the simulation where high difference between simulation and prediction is observed. The line along the front (where the fluid flow hits the object) has the highest mismatch between simulation and prediction for all examined resolutions. This behavior can be explained by the high velocity gradient along the obstacle front as the fluid is slowed down from free-flow velocity to zero. Neural networks tend to have difficulties where big gradients in the input prevail. High absolute error also occurs at the border to the domain of slipstream behind the obstacle where there are high gradients along the y-axis in the x-direction of the fluid flow. This holds for all grid resolutions equally. Additional prediction error is introduced within the obstacle itself. Here, the fluid flow velocity is zero per definition. The neural network is not completely able to predict the fluid flow to be zero here. However, this error is smaller than the errors introduced in areas of high velocity gradients.

Absolute error between ground truth and prediction |

Block-shaped prediction artifacts are dominant in flow regions close to the outlet outside the slipstream behind the obstacle. These artifacts might be caused by the convolution filters in the neural network as the blocks might correspond to the visual input region of one convolutional layer into the next layer. An insufficient number of training samples or insufficient convergence of the neural network weights during training might be causes of the artifacts. Low prediction error is dominant in flow regions close to the inlet and in the slipstream of the obstacle. This can be explained by the fact that these are regions of uniform flow velocity. Close to the inlet, the flow velocity is very close to the inlet boundary condition velocity and behind the obstacle in the slipstream the flow velocity is approximately zero. These simple flow characteristics are easy for the neural network to learn.

To evaluate the advantages of the data-driven approach to CFD, not only the accuracy of the approach in relation to state-of-the-art CFD solvers must be determined, but also the time it takes to derive a solution to a posed problem.

Creating the 10000 samples of the external two-dimensional simulation data set takes approximately **13 hours** on the used machine. In addition to the data-set creation time, the neural net training duration must be considered. The table below lists the training durations for the different grid resolutions, but also the number of simulation predictions per second once the neural net was trained on the simulation data set, and the speedup in comparison to the OpenFOAM simulation.

** |

Creating the minimum number of 10000 samples for the simulation data sets takes a considerable amount of time, especially for the three-dimensional simulations. The training process of the neural network is equally demanding in terms of required computational power and takes a similar amount of time until the neural network is sufficiently trained on the data set. It can be concluded that creating the necessary data sets with enough samples constitutes a substantial computational overhead for the data-driven approach to CFD.

However, once the neural network is trained on the simulation data set, a vast speedup in comparison to the traditional simulation-based approach to CFD can be observed. Due to the smaller numbers of input grid cells, the grids with the lowest resolution yield the largest speedups. Even the smallest speedup of about 57, which was achieved for the 256 x 256 grid cells is more than one magnitude faster than the state-of-the-art SimpleFoam CFD solver. The simulation speed of the data-driven approach to CFD allows for real-time flow field predictions of all tested simulation setups. The lowest prediction rate is still fast enough to allow for real-time predictions of fluid flow.

The aim of the proposed data-driven approach to CFD is to outperform traditional simulation-driven CFD in simulation setups the neural network was previously trained on. While there is substantial computational overhead in creating training samples and training the neural network on them, these calculations can be performed offline without interaction of the user. Thus, not introducing any waiting time for engineers. Compared to existing approx- imation models in the domain of CFD, neural networks enable an efficient estimation for the entire velocity field. Furthermore, designers and engineers can directly apply the CNN approximation model in their design space exploration algorithms without training extra lower-dimensional surrogate models.

The flow prediction results show that with an increased grid resolution, the overall accuracy of the predictions deteriorates. The reason for this unwanted behavior might be found in the lack of training samples and in the network design. While the number of training samples and the network architecture are sufficient to train the proposed neural network for low-res grid sizes and two-dimensional fluid data, this setup is insufficient for higher grid sizes with more fluid flow information that is encoded in the grid cells. More training samples provide a wider range of flow fields to learn from and a deeper or wider network architecture with more parameters promises to find more detailed flow patterns in the samples and thus potentially increases the prediction accuracy.

The data-driven approach can provide immediate feedback for real-time design iterations at the early stages of design exploration. Immediate feedback allows the engineer or designer to explore designs during the creative design process without interrupting the creative process. Inference times of well below 0.01 seconds for the highest tested grid resolution are below the latency threshold of the human brain.

The data driven approach to CFD is not intended as a replacement of existing CFD software. However, it can be understood as a tool for the first step of the product development process. In a second step, high-performance simulation-driven CFD can be employed to further refine the results of the data-driven approach to CFD in order to guarantee quantitative product simulation data with only a small margin of error. The claim of Machine Learning based approaches to CFD might never be to fully replace physics-based numerical solvers but to serve as a good-enough approximation method for fluid flow behavior where time and computational resources for a high-resolution CFD simulation are sparse. While the accuracies of the data-driven approach to CFD can never reach the accuracies of numerical CFD solvers, a suitable application for data-driven approaches to CFD might include computer games, where no high accuracy is needed, but very fast execution time.

This part 3 marks the end of the three-part series on Neural Networks for Steady-State fluid flow prediction. I hope you enjoyed the journey as much as I did.

Thanks for reading! 👨💻 🎉

]]>