1 Introduction

Humans can interpret the environment by processing information that is contained in visible light radiated, reflected, or transmitted by the surrounding objects. Computer vision algorithms try to perceive images coming from sensors. Due to bigger and higher resolution screens, smart devices have become suitable for navigation since they are equipped with necessary sensors, such as global navigation satellite system (GNSS), DMC, accelerometer, and gyroscope. Despite GNSS, the earth’s magnetic field can be used to obtain a rough estimate of the position and orientation of the observer, the precision of mobile sensors is not high enough for AR applications. The compass can be biased by metal and electric instruments nearby although frequent calibration, so measuring the magnetic north is not reliable. Several studies, for example (Blum et al. 2013; Hölzl et al. 2013) have examined sensor reliability in real-world tests and showed the error of DMC could be as high as \(10^\circ \)\(30^\circ \). However, the error of gyroscope and accelerometer are also increasing with the elapsed time, and the accuracy of GNSS could be up to several meters, they are not that critical from the perspective of this research.

Visual localization is a six-dimensional problem of finding the position (longitude, latitude, elevation) and orientation (pan, tilt, roll) from a single geotagged photo. Visual orientation from an image requires that the position of the observer is at least roughly given, the photo is taken not far from the ground, and the camera is approximately horizontal. That means the problem can be reduced to a one-dimensional instance in which the pan angle or in other words, the azimuth need to be determined. Computer vision can help to improve the precision of the sensors by capturing visual clues whose real-world positions are accurately known. This study proposes a method that can exploit the skyline from an image and match it with the panoramic or synthetic skyline extracted from a rendered DEM in real-time. Thus, the orientation of the observer can be improved, which is critical in AR applications.

In this paper, the focus is on mobile mountaineering apps that annotate mountain photos by matching images with 3D terrain models and geographic data. Nowadays, the ideal hiking app should have the following key features: rendered 3D terrain models, highly detailed spatial data, and AR mode with automatic orientation. Popular AR apps such as PeakVisor and PeakFinder AR have a well-developed mountain identification function. Some can render the digital terrain model and label the name of peaks nearby and additional information. In some cases, uploaded images can be annotated, as well. However, the horizontal orientation is usually imprecise; thus, fine-tuning is required by the user for the perfect result. One of the few applications that employs sophisticated artificial intelligence algorithms is PeakLens, but it focuses solely on this function. The forthcoming and fully panoramic \(360^\circ \) version of this app by La Salandra et al. (2019) can be used with Virtual Reality (VR) devices too. Lütjens et al. (2019) give a good example of how VR can offer intuitive 3D terrain visualization of geographical data.

Fig. 1
figure 1

AR application for peak identification

The main contribution of this study is a novel edge-based procedure for automatic skyline extraction and a real-time method that increases the accuracy of the azimuth for a future AR application whose operation is demonstrated in Fig. 1. An original photo taken by the camera can be shown in Fig. 1a; Fig. 1b introduces the DEM with pertinent geographical data; the fusion of the image and information of interest can be seen in Fig. 1c. There are three main steps in the present approach:

  1. 1.

    Panoramic skyline determination from DEM.

  2. 2.

    Skyline extraction from the image.

  3. 3.

    Matching the two skylines.

The rest of the paper is organized as follows: Sect. 2 overviews relevant works in this field; Sect. 3 describes the proposed method; Sect. 4 presents the experimental results. Finally, conclusions and outlook are drawn in Sect. 5.

2 Related Work

In recent years, there has been considerable interest in the challenging task of visual localization in mountainous terrain. In natural scenarios, vegetation changes rapidly as well as lighting and weather conditions. Since the most stable and informative feature is the contour of the mountains, i.e., the skyline, thus it can be used for orientation.

Many experts examine the so-called drop-off problem when the observer or an Unmanned Aerial Vehicle (UAV) is dropped off into an unfamiliar environment and try to locate its position. Preliminary work by Stein and Medioni (1995) focuses primarily on pre-computed panoramic skyline matching with manually extracted skylines. Tzeng et al. (2013) investigate a user-aided visual localization method in the desert using DEM. Once the user marks the skyline in the query image manually, this feature is looked up in the database of panoramic skylines that is rendered from the DEM. Camera pose and orientation estimation from an image and a DEM were studied by Naval et al. (1997). This non-real-time approach classifies the sky and non-sky pixels by a previously trained neural network. Peaks and peak-like protrusions are used as feature points in the matching phase, where pre-calculated synthetic skylines are stored in a database which is not favourable in a real-time AR app due to the computation and storage needs.

Fedorov et al. (2016) propose a framework for an outdoor AR application for mountain peak detection called SnowWatch, and describe the data management approach of it. Sensor inaccuracy and position alignment are partially discussed in their paper. In contrary to the present study, they take in input the device orientation as well, and they reached a slightly higher peak position error (\(1.32^\circ \)) on their manually annotated dataset. SwissPeaks is another AR app that overlays peaks that is presented by Karpischek et al. (2009). The main limitation of the app is that the correct azimuth should be set manually since visual feature extraction or matching was not implemented. Lie et al. (2005) examine skyline extraction by a dynamic programming algorithm that looks for the shortest path on the edge map based on the assumption that the shortest path between image boundaries is the skyline. A similar solution is investigated by Hung et al. (2013), where a support vector machine is trained for classifying skyline and non-skyline edge segments. A comparison of four autonomous skyline segmentation techniques that use machine learning is reviewed by Ahmad et al. (2017). The above-mentioned studies focus only skyline extraction, and their outcomes are hard to compare with the results of this paper.

A non-real-time procedure for visual localization is suggested by Saurer et al. (2016). They introduce an approach for large-scale visual localization by extracting skyline from query images and using a collection of pre-generated, vector-quantized panoramic skylines that are determined at regular grid positions. For sky segmentation they use dynamic programming but their solution requires manual interaction by the operator in case of challenging pictures, which amounted to \(40\%\) of the samples. An early attempt has been made by Behringer (1999) to use computer vision methods for improving orientation precision. Due to computation complexity, this solution was tested in non-real-time. Baboud et al. (2011) also present an automatic, but non-real-time solution for visual orientation with the aim of annotation and augmentation of mountain pictures. From geographical coordinates and camera FOV, this system automatically determines the pose of the camera relative to the terrain model by using contours extracted from the 3D model. They use an edge-based algorithm for skyline detection, and they propose a novel metric for fine-matching based on the feasible topology of silhouette-maps. However, the algorithm is sophisticated, it is not suitable for AR applications. An unsupervised method for peak identification in geotagged photos is examined by Fedorov et al. (2014). They extract the panoramic skyline by edge detection from the rendered DEM, but they do not address exactly how to obtain the skyline from an image.

It is worth to note that infra-red cameras are also put in an application for localization in mountain area, see e.g., Woo et al. (2007). They designed a procedure for UAV navigation based on peak extraction. Special sensors that are sensitive in the IR range could work better under lousy weather or weak light conditions. Unfortunately, a real-world test is not presented in their study.

Visual localization in an urban environment is a related problem. Several studies have been carried out on visual-aided localization and navigation in cities where the sky region is more homogeneous than other parts of the image. For instance, Ramalingam et al. (2010) employ skyline, and 3D city models for geolocalization in GNSS challenged urban canyons. Zhu et al. (2012) match the panoramic skyline extracted from a 3D city model with a partial skyline from an image.

3 Method

Fig. 2
figure 2

The determination of panoramic skyline

The proposed method consists of three main stages. The first stage is to determine the panoramic skyline from the DEM by a geometric transformation suggested by Zhu et al. (2012). After that, the skyline from the image has to be extracted. Finally, the matching is carried out by maximizing the correlation between the two skyline vectors. C++ and OpenSceneGraph were used for panoramic skyline determination. The image processing task and matching were carried out by MATLAB (Image Processing Toolbox). Finally, georeferencing was made with the help of Google Earth Pro and QGIS.

3.1 Panoramic Skyline Determination

Panoramic skyline is a vector obtained from the 3D model of the terrain. In this research, publicly available DEMs, SRTM and ASTER were used, sampled at a spatial resolution between 30m and 90m. Depending on the distance of the viewpoint from the target and properties of the terrain in the corresponding geographical area that could be a bit coarse, but in most cases, this resolution was satisfactory. Figure 2a shows a rendered DEM, where the black triangle is the position of the camera, which was determined by GNSS. The \(360^{\circ }\) panoramic skyline was calculated from this point by a coordinate transformation, as Fig. 2b shows, where

  • \(C(X_0,Y_0,Z_0)\) is the position of the camera,

  • D(XYZ) is an arbitrary point of the DEM,

  • \(D^{\prime }(x^{\prime },y^{\prime },z^{\prime })\) is the projection of point D.Footnote 1

Hereby, each point can be described by the azimuth angle:

$$\begin{aligned} \varphi = {\left\{ \begin{array}{ll} 0, &{} \text {if } X = X_0 \text { and } Z = Z_0 \\ \arcsin \left( \frac{z^{\prime }-Z_0}{\rho }\right) , &{} \text {if } X \ge X_0 \\ -\arcsin \left( \frac{z^{\prime }-Z_0}{\rho }\right) + \pi , &{} \text {if } X < X_0 \end{array}\right. } \end{aligned}$$

and the elevation angle:

$$\begin{aligned} \theta = \arcsin \left( \frac{Y-y^{\prime }}{r}\right) , \end{aligned}$$

where

$$\begin{aligned} \rho = \sqrt{(x^{\prime }-X_0)^2 + (z^{\prime }-Z_0)^2} \end{aligned}$$

is the distance between C and \(D^{\prime }\) and

$$\begin{aligned} r = \sqrt{(X-X_0)^2 + (Y-Y_0)^2 + (Z-Z_0)^2} \end{aligned}$$

is the distance between C and D. A 3D to 2D transformation was applied since the height information or the radial distance is no longer required. Azimuth angle \(\phi \) and the elevation angle \(\theta \) describe any point D in the DEM. Finally, the greatest \(\theta \) value determines the demanded point of the skyline for each \(\varphi \). Figure 2c illustrates the panoramic skyline projected on a satellite image. The sharp edges on the left corner, indicate the border of the DEMs because the skyline was calculated only at a reasonable distance. Figure 2d shows the panoramic skyline vector that will be used in the matching stage.

3.2 Skyline Extraction

Fig. 3
figure 3

Extraction of the skyline

The skyline sharply demarcates terrain from the sky on a landscape photo. An automatic edge-based method is presented in this study for skyline extraction. The idea is based on the experience that large and wide connected components in the upper region of the image usually belong to the skyline (Fig. 3).

In the feature extraction step connected components labeling was used, which is a well-known algorithm for finding blobs in a binary image and assign a unique label to all pixels of each connected component. Figure 4a shows an input binary image with disjoint edge segments that coloured to different shades of grey in the output, see Fig. 4b. A flood-fill algorithm was applied for finding 8-connected components, i.e., pixels with touching edges or corners. A detailed review on connected components labeling is found in He et al. (2017). It is not necessary to detect the whole skyline since, in most cases, recognizing only an essential part of it is enough for matching. On the other hand, it is crucial to extract a piece from the real skyline and not a false edge.

In the preprocessing step morphological operations were carried out to enhance the greyscale image and remove noise. Morphological closing (dilation and erosion) eliminate small holes, while morphological opening (erosion and dilation) removes small objects from the foreground that are smaller than the structuring element. A disk-shaped structuring element was used either for closing and opening but with different radius (5 and 10 pixels). Details on morphology can be found in Szeliski (2011).

The algorithm selects the skyline from skyline candidates in multiple steps. The candidates (C) were sorted by the function \(S(C) = \mu (C) + 2\rho (C)\), where \(\mu \) measures the number of pixels in the candidate and \(\rho \) is the span of the candidate, i.e., the difference between the rightmost and the leftmost pixel coordinates in the image space. Based on the experiments, this function that takes into account the size and the span of C with double weight is proved to be the most efficient. Therefore, larger and broader skyline candidates are preferred.

The main steps of the approach are listed below and also illustrated in Fig. 3.

  1. 1.

    Preprocessing

    1. (a)

      The first step is to resize the original image to \(640 \times 480\) pixels and adjust the contrast (Fig. 3a).

    2. (b)

      Based on the observations, the sky is in the sharpest contrast to the terrain in the blue colour channel in RGB colour space. Thus the blue channel was used as a greyscale picture.

    3. (c)

      Morphological closing and opening operations are applied for smoothing the outlines, reducing noise, and thereby ignoring the useless details, e.g., edges of tree branches or rocks (Fig. 3b).

    4. (d)

      The edge detection is carried out by Canny edge detector results in a bitmap that contains the most distinctive edges on the image (Fig. 3c).

  2. 2.

    Connected components labeling detects the connected pixels on the edge map determining the skyline candidates. The top three skyline candidates were chosen by the evaluating function S (Fig. 3d).

  3. 3.

    A top-down search selects the first edge pixels from the most probable candidates in each column because the skyline should be on the upper region of the image (Fig. 3e).

  4. 4.

    Since it might make a hole in the real skyline, a bridge operation fills the one-pixel gaps.

  5. 5.

    A second connected component analysis eliminates the left-over pieces from the edge map and selects the largest one as the presumed skyline (Fig. 3f).

  6. 6.

    Finally, the skyline was vectorized in order to make matching more effective (Fig. 3g).

Fig. 4
figure 4

Connected components labeling

3.3 Skyline Matching

The last stage of the proposed method is matching the panoramic skyline and the recognized fragment of the skyline from the image. That point from where the skyline vectors interlock was looked for, i.e., the image skyline fits into the panoramic skyline, from where \(\varphi \) could be obtained. For a proper comparison, the Horizontal Field of View (HFOV) of the camera and the panoramic skylineFootnote 2 need to be synchronized via the sampling rate of the two signals. For the sake of simplicity, the first index of the panoramic skyline vector corresponds to \(0^\circ \) (north) as a reference point. In the case of a partially extracted image skyline, the gap also should be considered in accordance with HFOV, i.e., the total width of the skyline is estimated.

Then, normalized cross-correlation \((a \star b)\) was used which is often applied in signal processing tasks as a measure of similarity between a vector a (panoramic skyline) and shifted (lagged) copies of a vector b (extracted skyline) as a function of the lag k. After calculating the cross-correlation between the two vectors, the maximum of the cross-correlation function indicates the point K where the signals are best aligned:

$$\begin{aligned} K = \mathop {{{\,\mathrm{argmax}\,}}}\limits _{0^\circ \le k < 360^\circ } ((a \star b)(k)). \end{aligned}$$

From K the azimuth angle \(\varphi \) can be determined, and the estimated horizontal orientation can be acquired. As it was mentioned above, the camera is supposed to be approximately horizontal when the picture was taken, though the skyline could be slightly slanted. However, cross-correlation proved to be insensitive to this kind of inaccuracy, thus this approach is appropriate for matching the skylines. An example of matching the two skylines is presented in Fig. 5.

Fig. 5
figure 5

Matching of the panoramic skyline (blue curve) and the skyline extracted from the image (red curve)

4 Experimental Results

The goal of this study was to develop a procedure that can determine the exact orientation of the observer in a mountainous environment by a geotagged camera picture and a DEM. The main contribution of this paper was an edge-based skyline extraction method. Thus the first part of this section demonstrates the results on sample images. The second part is about calculating \(\varphi \) and comparing the results with the ground truth azimuth angles (\({\hat{\varphi }}\)) determined by traditional cartographic methods using reference objects in the image.

4.1 Skyline Extraction

Skyline extraction is a crucial task in this method. The whole pattern is not necessarily needed for the correct alignment; in most cases, only a characteristic part of the skyline is enough for the orientation. The algorithm was tested on a sample set that contains mountain photos from various locations, seasons, under different weather and light conditions. The goal was to extract the skyline feature as precisely as possible and classify the outputs. The pictures were made by the author or they were downloaded from Flickr under the appropriate Creative Commons license. The collection consists of 150 images with \(640 \times 480\) pixels resolutions and 24-bit colour depth. Experiments showed this resolution provides suitable results considering computation performance, as well. Figure 6 illustrates the extraction steps on four different instances. For details on the steps, see Sect. 3.

Fig. 6
figure 6

Various successful examples of automatic skyline extraction: a shows a craggy mountain ridge with clouds and rocks that could mislead an edge detector; in b the snowy hills blend into the cloudy sky mountain which makes skyline detection difficult; c is taken from behind a blurry window, where raindrops and occluding tree branches could impede the operation of an algorithm; d demonstrates a hard contrast image with clear skyline, however clouds might induce false skyline edges

The outputs were grouped into four classes according to the quality (%) of the result. The evaluation was done manually because type I and type II errors also can occur, and an objective measure is difficult to create.

  • Perfect: [95–100%]; the whole skyline is detected, and no interfering fragments found.

  • Good: [50–95%]; the better part of the skyline is detected, and possible errors do not affect the analyses.

  • Poor: [5–50%]; only a small part of the skyline is detected, and possible errors might affect the analyses.

  • Bad: [0–5%]; the detected edges do not belong to the skyline.

Table 1 shows that the extracted skylines assigned to perfect or good classes in more than \(89\%\) of the samples. In these cases, the extracted features are suitable for matching in the next algorithm phase. It is noteworthy that the rate of poor is \(8\%\) and bad outcomes is less than \(3\%\). When the algorithm fails, the difficulties usually arise from occlusion, foggy weather, or low light conditions. Sometimes, in hard contrast pictures with plenty of edges, e.g., deceptive clouds, or rocks, the largest connected component did not necessarily belong to the skyline and it is difficult to find the horizon line even with the naked eye.

Table 1 Results of automatic skyline extraction method

4.2 Field Tests

Unfortunately, it was not possible to compare the results directly with those obtained by other algorithms discussed earlier, due to the different problems they addressed. Therefore, field tests were made by the author to measure the performance of the method. The experiments aimed to determine the orientation using only a geotagged photo and the DEM. A Microsoft Surface 3 tablet was employed, which has in-built GNSS sensor and an 8MP camera sensor with \(53.5^\circ \) HFOV. Various pictures were collected in the mountains with clearly identifiable targets, e.g., church or transmission towers, and aligned them into the center of the image with help of an overlying grid. The EXIF data contains the position, so the recognizable target concerning the viewpoint could be manually referred, i.e., (\({\hat{\varphi }}\)) for the 10 sample images. The low sample size is due to the difficult task of manually orientate test points and the lack of a publicly available image data set with georeferenced objects.

Figure 7 and Table 2 present examples and the experimental results of the field tests. Only good or perfect skylines were accepted for this test and the correlation was almost \(95\%\) on average. The mean of absolute differences between \({\hat{\varphi }}\) and \(\varphi \) was \(1.04^\circ \), which is auspicious and could be enhanced with a higher resolution DEM. As it was mentioned in Sect. 1 the error of DMC could be 10–30°. Measuring the inaccuracy of the compass sensor was beyond the scope of this study. Nevertheless, this problem was experienced during field tests. The benefit of the proposed algorithm is the more accurate orientation by the camera picture and a DEM instead of the unreliable DMC. The purpose of field tests was to demonstrate the precision that can be achieved with this method. In the tests the main reasons for the average \(1.04^\circ \) error were the coarse resolution of DEMs and the vegetation, as can be seen in the examples of Fig. 7a–d. Since cross-correlation proved to be less sensitive to this kind of inaccuracy, thus it is applied in the matching phase.

Fig. 7
figure 7

Example test images for the reference measurements in the field with the extracted skyline (white), panoramic skyline (orange) and the reference object (yellow cross) that was aligned to the center of the photo. The main difference between the two skylines is due to coarse DEM and vegetation

Table 2 Result of the field tests

5 Conclusions and Outlook

This study proposed an automatic, computer vision-based method for improving the azimuth measured by the unreliable DMC sensor in mountainous terrain. The aim was to develop an algorithm for an outdoor AR app that overlays useful information about the environment from a Geographic Information System (GIS), e.g., peak name, height, distance. The main contribution of this work is the robust skyline extraction procedure based on connected components labeling. The skyline was extracted successfully in more than \(89\%\) of the sample set that contains various mountain pictures. Furthermore, field tests were also carried out to verify skyline matching. The deviation of the azimuth angle provided by the algorithm and the ground truth azimuth was examined, and \(1.04^\circ \) average accuracy was reached. Performance issues were beyond the scope of this study. Nevertheless, the algorithm is time and storage efficient, the results are promising and they showed that the proposed method can be applied as an autonomous, highly accurate orientation module in a real-time AR application that is under development. With suitable data and some adaptation, the system could be also used for visual localization in GNSS challenged urban environment.