1 Introduction

Convolutional neural networks (CNNs) gained popularity in recent years thanks to the availability of powerful GPUs that enable to efficiently train accurate classification models [19]. Visual deep learning enables applications like 3D shape classification [32], multi-label image classification [6], Flyer classification [39], face detection [7], and automatic car counting [35]. For building practical applications, the deep-learning community shares a common interest in reducing the development cycle, while increasing model accuracy and keeping infrastructure and power consumption expenditure under control. Many publications address these conflicting goals [10, 17, 18]. Most machine-learning approaches require a human in the loop responsible for taking crucial decisions such as defining the network, finding good combinations of hyper-parameters and performing adequate preprocessing on the input data. To overcome the problem of manual selection, various automated approaches such as grid search, random search [3], Bayesian optimization [46] or hyperband optimization [30] have been proposed. These methods operate autonomously and improve model performance; however, they still suffer from two limiting factors. First, they require a definition of the search space. Second, they consume a large amount of resources for a single optimization task.

In this paper, we propose automated methods for quantifying the difficulty of a classification problem in terms of how hard it is to reach high accuracy for a given dataset. Our developed neural network architectures, the probe nets, reliably forecast how well any machine learning will perform on the given dataset. We designed probe nets to be small to allow fast training, and still, they provide a scoring value that consistently outperforms the three alternative scoring pipelines. The proposed probe nets can be used in combination with architecture search optimizers to efficiently drive the search toward promising configurations, avoiding the exploration of unsuitable networks. Consciously or not, the characterization of dataset difficulty is a process, followed by every deep learning architect. When looking for a well-performing model for a new dataset, common practice is to try state-of-the-art networks to evaluate how hard is to classify the images in the dataset. Since datasets are large and models complex, the process of training, comparing, and selecting a few state-of-the-art deep networks becomes a computationally heavy task. Probe nets improve this step by providing a classification difficulty estimator, which provides insights into the classification task and can be used to rapidly confine the exploration to a few promising networks.Our developed probe nets characterize datasets orders of magnitude faster than the actual training and have high correlation with state-of-the-art network accuracies.

In summary, our main contributions are the following:

  • We conduct a large literature and experimental study on reference machine learning algorithms, including 484 cited state-of-the-art results.

  • We reproduced results of thirty deep learning models over sixteen datasets to defend the key observation that no single algorithm outperforms the alternatives. We establish the reference dataset classification difficulty on the ensemble of reference results.

  • We propose and evaluate four dataset complexity scoring pipelines to estimate the classification difficulty.

  • We develop probe nets as suitable candidates to efficiently and reliably estimate the classification difficulty.

  • We evaluate approximate computing techniques, such as subsampling and early stopping, to further reduce the execution time without affecting the final results.

  • We showcase the proposed dataset characterization used in an architecture search setting.

The remainder of the paper is organized as follows: Section 2 describes the related work, and Sect. 3 introduces the notation and presents an overview of reference models and their performance on datasets. Section 4 details the adopted methodologies. Section 5 examines the results. Section 6 studies efficient implementations, Sect. 7 demonstrates how the methodologies are applied to perform an efficient architecture search, and Sect. 8 concludes all findings.

2 Related work

The topic of difficulty estimation of a dataset is scarcely explored in the literature. In [51], the authors address the difficulty of visual search within one image by assessing the human response time for solving a visual search task. Their technique employs two VGG-like [45] networks that work as encoders and extract features that are further passed through a regressor to produce a per image score. In contrast, our technique focuses on defining how easily separable are different classes within a dataset; henceforth, our proposed score measures how challenging a classification instance is. We omit a direct comparison since their reference consists of measured human response times including high variances causing their approach to produce modest correlations. In contrast, we motivate and validate our score based on a extensive set of reference results, causing stable results and strong correlations in our best proposed approach.

In [22], the authors propose measures that characterize the difficulty of a classification problem, focusing on the geometrical complexity of the class boundary. However, those measures are defined and evaluated over binary classification problems defined over a low-dimensional feature vector space. In contrast, we focus on image classification problems building a high-dimensional spatially correlated data space consisting of multiple classes.

The H-divergence is a rigorous measure computed between two datasets in the field of domain adaption [2, 15, 28] where a source and target distribution of datasets are compared. In contrast, our approach in this work scores the classification complexity stemming from a single dataset. Scoring datasets independent of additional information allows to efficiently extend a collection of datasets and their scores without computing cross-interactions between multiple datasets.

The latest research on neural network design and network architecture search is accounted for by considering suggested architectures in this work. We reproduce all results for a fair comparison. That approach covers variants from residual bypass operations (ResNets [20]) to even high fan-out and convergent structures such as they occur in the inception module [48] or DenseNets [24]. Structures that favor lighter operations for better performance on constraint devices, such as MobileNets [23], and by architecture search generated structures, such as PNASNets [31], are included in our study.

The main concepts and key ideas of characterizing datasets are available in a preprint [44]. This work summarizes those findings and extends them in three ways: First, the established reference dataset characterization is enlarged from one reference architecture to an ensemble of state-of-the-art architectures; second, a novel and efficient Fréchet inception distance-based pipeline is added, and third, a use case of architecture search demonstrates time critical design choices and the usefulness of the dataset characterization in a real application.

3 Datasets and reference models

3.1 Notation

In this work, we refer to a dataset with the quadruple \(\mathcal {D}:= (\mathbf {X}_{train}, \mathbf {y}_{train}, \mathbf {X}_{test}, \mathbf {y}_{test})\), where \(\mathbf {X}_{train} \in \mathbb {R}^{n_{train} \times d}\) and \(\mathbf {X}_{test} \in \mathbb {R}^{n_{test} \times d}\) are the training and testing inputs, and \(\mathbf {y}_{train} \in [1,C]^{n_{train}}\) and \(\mathbf {y}_{test} \in [1,C]^{n_{test}}\) are the training and testing labels. We assume that the datasets come already split into train and test sets, as this is commonly the case for published data. We denote the dimension of the input samples as d, the number of input samples as \(n_{train}\) for training and \(n_{test}\) for testing, and the number of classes as C. \(\mathcal {M}\) refers to a model, including the network topology and related hyper-parameters, and it includes the training and data augmentation-related hyper-parameters. Therefore, the tuple \((\mathcal {D}, \mathcal {M})\) specifies a deep learning training run of model \(\mathcal {M}\) on dataset \(\mathcal {D}\). We denote with \(\text {Top-1}(\mathcal {D}, \mathcal {M})\) the Top-1 accuracy classification performance of the training run. In all experiments, training is performed with \((\mathbf {X}_{train}, \mathbf {y}_{train})\) and performance is measured on \((\mathbf {X}_{test}, \mathbf {y}_{test})\).

3.2 Datasets and literature referenced results

Table 1 Dataset overview
Fig. 1
figure 1

State-of-the-art results achieved on datasets with more than twenty reference results. Each point corresponds to a single published algorithm, capturing vanilla CNNs with or without transfer learning, non-deep learning approaches such as SVMs or random forests applied on handcrafted features, and other problem specific pipelines. Highlighted in black is the best reference solution we reproduced in a controlled environment

Fig. 2
figure 2

The number of samples within a given dataset used for training and testing sorted by training samples. Train and test sets are always disjoint, and the splitting is given as suggested by the reference. The number of training samples spans more than two orders of magnitude

Table 2 Established reference network architectures

In this work, we focus on sixteen public available and established image classification datasets as presented in Table 1. Figures 1, 2 depicts the number of samples used for training and testing. The datasets span two order of magnitudes in the number of classes and in the number of available training samples and one order of magnitude in the balance ratio. The datasets stem from various domains and cover typical and relevant use cases such as optical digit recognition stemming from handwritten samples (MNIST) or in the context of images stemming from house numbers (svhn). GTSRB covers traffic sign recognition, a use case that occurs in autonomous driving systems. Scene recognition aims to classify the location of where the picture was taken as whole (indoor67 and places), whereas traditional classification tasks are posed around identifying a class based on a particular object present within the image. Figure 1 shows external generated and collected state-of-the-art results. Various known machine learning solutions spread over a large performance range, except for simple datasets, such as MNIST [13], where most methods perform well. Highlighted is the best experimental value we reproduced in a controlled environment as follows in Sect. 3.3. These accuracy results are competitive with top performing published results and outperform published solutions in the case of fashion MNIST since the referenced results stem from an extensive study of traditional machine learning approaches without including deep learning algorithms [54].

3.3 Reference models

In order to evaluate the reference difficulty of an image classification problem, we report the accuracy obtained by an ensemble of established deep learning network topology. Since there exists no single network model that performs best on all available datasets nor are reference accuracies for all models published on all considered datasets, we reproduced reference accuracies for reference models within this work with an extensive set of experiments. We use a predefined list of well-established network architectures and report achieved accuracies on the dataset considered. Table 2 summarizes the network models we consider as established reference models. Most of them are provided with family-specific and topology-related hyper-parameters, such as, for example, VGG or ResNets where the parameter refers to the total amount of layers and controls the overall complexity of the model. All models are well established, and the only modification we performed was to adapt the weight matrix of the last output layer such that the number of output neurons matches the number of classes for the given dataset. Figure 3 shows averaged normalized execution times for one batch of size 128 for training and testing. All experiments in this paper are obtained with PyTorch version 0.4.1 and run on an IBM Power8 equipped with a P100 GPU. Timings are measured in a realistic setting, e.g., including occurring overheads of loading and transferring data between CPU and GPU and kernel lunch overheads. Training times of one batch include operations caused by back propagation and the weight update, while testing times refer to the elapsed time the model required to perform a batched forward inference. The fastest model considered is LeNet that trains within \(30\,\mathrm {ms}\) per batch which is \(16\times \) faster than ResNeXt29_4x64d that takes about \(480\,\mathrm {ms}\) per training batch.

Fig. 3
figure 3

Average timings per batch of size 128 for training and testing of reference models. Timings span from \(30\,\mathrm {ms}\) up to \(480\,\mathrm {ms}\) per training batch

The next section empirically demonstrates two insights obtained from the ensemble of reference models: First, there does not exist a single model that outperforms all other models on every dataset. Second, for a given datasets, there are trends of how the ensemble of models behaves. The first observation builds a key argument to perform architecture search on new problem instances to find suited models on new datasets. The latter demonstrates the inherent difficulty of the problem instance due to the given classification difficulty present in the data independent of the specific model fitted on it.

3.4 Reference results as proxy for classification difficulty

Figure 4 shows all reference accuracies obtained when training established models on the considered datasets as described in Sect. 3. Results of the models cluster at a dataset-specific accuracy saturation. That saturation limit demonstrates the existence of a data inherent classification difficulty for machine learning algorithms independent of the model choice. We empirically demonstrate that a single best model does not exist by highlighting some of the results presented in Fig. 4. Table 3 reports which reference model has achieved the best accuracy and henceforth represents the accuracy saturation limit of current state of the art. Out of the 16 problem instances imposed by the 16 datasets, twelve different reference models belong to the best performer and some of them were multiple times selected as best candidate. However, none of the reference architecture is able to clearly outperform the others. In some cases, such as for MNIST, all reference accuracies obtained by various models are close which makes it challenging or impossible to distinguish two different models or to select the best model. To address this fact, Table 4 demonstrates the importance of each of the reference models, by comparing their minimal, maximal and average percent point loss against the best performing model out of the ensemble considered per dataset. Some models perform pretty well across all datasets, such as the DenseNet161 that only performs 1.12% points worse on average than the best performer of the ensemble. However, there exists a dataset where the drop against the best model in that case is still large with 5.73% points.

Despite the fact that there are the above discussed variations of results present among various models, there exist an underling dataset characteristic that saturates the achieved performance of any machine learning-based model. Based on this observation, we can define the theoretical dataset classification difficulty as the saturating accuracy obtained with any best model. However, since at the time of writing we only conducted experiments on a finite list of reference models, we define a stable formulation as mean accuracy obtained over the best \(k=5\) models in the reference experiments of the architectures listed in Table 2. The last column of Table 3 lists the reference dataset classification difficulty number (DCN) that acts as a proxy of the real difficulty that is not measurable.

4 Classification difficulty estimation of datasets

In this work, we propose four pipelines to quantify the difficulty of image classification datasets. In more detail, we propose dataset scoring functions \(r(\mathcal {D})\) to map a dataset \(\mathcal {D}\) to a scalar real number, with the goal of scoring different datasets in terms of classification accuracy estimates. For each pipeline, we highlight pros and cons.

Fig. 4
figure 4

Reproduced reference results over an ensemble of established deep learning architectures. Clearly, results cluster at dataset-specific levels, demonstrating the impact of the inherently present classification difficulty imposed by the data of the problem instance. We highlight results obtained with two different architectures, demonstrating that neither one of them is outperforming the others in all cases

Table 3 Best reference architectures per dataset, and no single architecture is able to outperform the remaining architectures
Table 4 Comparison of reference models

4.1 Silhouette score

The silhouette score is a well-established metric that compares tightness of same-class samples to separation of different-class samples [43]. Let i be one input sample, a(i) the average Euclidean distance between the sample and all the points j belonging to the same class as i, and b(i) the average distance between i and all points j of the closest different class. The silhouette of the i-th sample is computed as follows [43]:

$$\begin{aligned} s(i) \quad := \quad {\left\{ \begin{array}{ll} 1 - a(i)/b(i), &{}\text {if}\ a(i) < b(i) \\ 0, &{}\text {if}\ a(i) = b(i) \\ b(i)/a(i) - 1, &{}\text {if}\ a(i) > b(i). \end{array}\right. } \end{aligned}$$
(1)

The silhouette of one class is defined as the average over all samples belonging to that class, and the overall silhouette score of the full dataset is defined as average over all samples. The definition of the quantities a(i) and b(i) is based on pairwise distances between two samples i and j. The silhouette score complexity is \(O(\bar{d}n^2)\), where n is the number of samples and \(\bar{d}\) is the cost of computing the distance of one pair of samples as mean squared error (MSE) distance in the \(\mathbb {R}^{\bar{d}}\). Since the MSE distance in the original domain is a poor measurement for image similarities, we apply first a transformation \(\mathbb {R}^d \rightarrow \mathbb {R}^{\bar{d}}\) that maps images into a space that better reflects distances between image pairs.

Table 5 Configurations used to compute the silhouette score on datasets

Table 5 provides details on the applied pipelines. We decided to include a resizing of the images to a small resolution of \(8\times 8\) pixels, applying principal component analysis (PCA) to reduce the dimension to 10, and using a fixed encoding based on a pre-trained CNN inference. We considered as encoder a ResNet-50 [20] network pre-trained on ImageNet [12] to produce generalized per image feature vectors of dimensionality 1000 by taking the output of the last fully connected layer before applying the nonlinearity. Additional to the MSE distance, we used the structural dissimilarity index DSSIM [53] to compare images with a metric that captures spatial information. Due to the squared complexity, we applied heavy subsampling and run all computations with a maximum of 1000 randomly selected samples, resulting in a distance matrix with at most 1M entries. Table 5 reports timing among the different pipelines. For fast execution, it is crucial to operate in a low-dimensional space and to use a simple distance metric.

4.2 K-means clustering

The complexity of the silhouette scores detailed in Sect. 4.1 scales with \(n^2\), and computing it is a slow process even after subsampling. In general, the complexity of a deep-learning job is \(O(C_{\mathcal {M}}n_{train}E)\), where E is the number of epochs. During one epoch the full training set consisting of \(n_{train}\) samples is fed once with a computational cost of \(C_\mathcal {M}\), which is a model \(\mathcal {M}\) dependent constant. Even though complex models might have large computational cost of \(C_\mathcal {M}\), the asymptotic behavior of a training job is linear in n. For this reason, the asymptotic behavior of the silhouette score computation of \(n^2\) is outperformed by the actual training job. Competitive scoring metrics should execute faster than a train job itself; thus, we are looking for scores with at most linear complexity in n.

We propose to run a (fast) clustering algorithm to produce class labels \(\tilde{\mathbf {y}}\) and evaluate the full dataset based on metrics that compare \(\tilde{\mathbf {y}}\) against the ground truth labels \(\mathbf {y}\). We assess the following known scores: adjusted mutual information [52], adjusted rand index [25], completeness, homogeneity, and the v-measure [42]. Additionally, we propose a tailored score based on the estimation of the confusion matrix built between the cluster indices and the true labels. This score is computed over the permutation of possible labeling configurations of the unsupervised cluster indices that maximizes the trace of the confusion matrix.

4.3 Fréchet inception distance based score

The Fréchet inception distance (FID) is widely used as measure to compare the quality of generative adversarial networks (GANs) [21, 33] where a comparison of synthetic and real distributions are required to measure the performance of GANs. To that end, the input image samples are feed through an inception network [49] such that each sample is embedded in a learned feature vector space. The resulting embedded vectors of one class are assumed to follow a multivariate Gaussian distribution. The Fréchet distance [14] of the two Gaussian distributions A and B is defined as follows:

$$\begin{aligned} {\Vert \mu _A - \mu _B\Vert }_2^2 + {{\,\mathrm{\mathbb {T}r}\,}}(\Sigma _A + \Sigma _B - 2(\Sigma _A\Sigma _B)^{1/2}) \end{aligned}$$
(2)

where (\(\mu _A\), \(\Sigma _A\)) and (\(\mu _B\), \(\Sigma _B\)) are the mean vector and covariance matrix of two distributions A and B, respectively. Since the FID is defined between two distributions only, a full image classification problem with C classes is characterized by the pairwise FID between classes i and j for \(1< i,j < C\). To summarize and normalize that dataset difficulty as scalar value in an uniform way, we propose the following FID-based score in (2), similar to the definition of the Silhouette score:

$$\begin{aligned} f := {\left\{ \begin{array}{ll} 1 - \bar{FID}_{i,i}/FID_{i,j*}, &{}\text {if}\ \bar{FID}_{i,i} < FID_{i,j*} \\ 0, &{}\text {if}\ \bar{FID}_{i,i} = FID_{i,j*} \\ FID_{i,j*}/\bar{FID}_{i,i} - 1, &{}\text {if}\ \bar{FID}_{i,i} > FID_{i,j*}, \end{array}\right. } \end{aligned}$$
(3)

whereas \(FID_{i,j*}\) denotes the critical distance to the closest neighboring class defined as \(FID_{i,j*} := \min _{i\in [1,C]\setminus j}\)\(FID_{i,j}\) and \(\bar{FID}_{i,i}\) operates as normalization coefficient. Since the Fréchet distance between two equally distributed Gaussians, \(FID_{i,i} \equiv 0\), evaluates to zero, we use \(\bar{FID}_{i,i} = FID_{i',i''}\) where the FID is evaluated over two statistical measurements of (\(\mu _{i'}\), \(\Sigma _{i'}\)) and (\(\mu _{i''}\), \(\Sigma _{i''}\)) by computing first- and second-order moments over two disjoint sets of samples belonging to the same class i. It turns out to be very beneficial to use the obtained nonzero numerical estimate as normalization coefficient as defined above in order to reduce the obtained Fréchet distances into a normalized score.

The time complexity of the FID based score is \(O(n_{s} C_{Emb} + C^2 C_{FID})\) where the first term is linearly dependent on the number of input samples where one inference of the inception network and the computation of the first- and second-order moments occurs. The second terms has a square dependency on the number of classes since (2) is required to be evaluated pairwise for each class i and j. Evaluating (2) costs \(C_{FID}\) which itself is determined by the invoked shapes of the matrix, d-dimensional mean vector and \(d \times d\) sized covariance matrix and mainly determined by the numerical routine computing the blocked Schur algorithm [11] for computing the matrix square root of the product of \(\Sigma _A\) and \(\Sigma _B\). Even though the matrix square root is optimized with a multi-threaded implementation and runs within tens of seconds for typical invoked problem sizes, the square root dependency to compute all pairwise distances to compute the final score turns that approach quite slow, for example for \(C=100\) classes and \(C_{FID} = 10 \sec \) the total computing time exceeds 24 h.

4.4 Probe nets

We propose probe nets that are small and efficient neural networks. We demonstrate that training a probe net and using its accuracy as a dataset difficulty score outperforms the alternative approaches in compute time and difficulty estimation performance. We designed the complexity \(C_{\mathcal {M}_{probe}}\) of the architecture of probe nets to be general enough to be applied to any image classification task but considerably faster than training a regular deep learning model \(\mathcal {M}\): \(C_{\mathcal {M}_{probe}} \ll C_\mathcal {M}\). Table 6 reports the operation count and the number of trainable parameters of the proposed probe nets. Additionally, to further speedup the execution time, we demonstrate and discuss in Sect. 6 how stopping the training of probe nets after a few epochs reduces execution time while still achieving good results.

Table 6 Operation count and number of parameters of proposed probe nets
Fig. 5
figure 5

Probe nets: simple deep learning architectures used to characterize datasets. Static networks are shown in ac, they only differ in the dense layer connections in the softmax output that ends with a dataset specific number of classes C. a shows a regular, narrow and wide probe net that differ in kernel depth, b shows a shallow version and c a deep version of the net for non and normalized kernel depths. Dynamic networks, df scale the topology with respect to the number of classes C. d Consists of dense layers that scale the number of hidden units according to linear weighted sum between input and output dimension, e scales kernel depths according to C and f scales the number or repetitions of stacked static layers according C

We propose to construct variations of two types of probe nets: static probe nets that have a fixed topology and dynamic probe nets that scale the topology according to the number of classes. The regular probe net consists of three convolutional layers, each followed by batch normalization, max pooling of size \(2\times 2\), and ReLU activations, which are defined element-wise as \(x \mapsto \max (0, x)\). We used eight kernels in the first layer and doubled the number of kernels per layer. We provide wide and narrow variations that scale the number of kernels per layer up and down by \(4\times \), respectively. Shallow and deep variations are obtained by subtracting and adding two layers, respectively. Since doubling the kernel sizes per layer leads to different tensor shapes between the last convolution and the C-way softmax, the non-normalized shallow and deep probe nets have a considerable different number of trainable parameters. We define normalized probe networks to match the number of trainable parameters of the output layer of the regular probe net. We construct dynamic nets with a more complex topology to account for more classes. This is achieved either by scaling dependent on C the number of hidden units in an multilayer perceptron mlp, the number of filters (filter depth scaled probe nets), or the number of stacked filters (length scaled probe net). Figure 5 shows the ten proposed prob net architectures. We evaluate all ten probe nets as reference but for resource and time efficiency our proposed approach suggest to use the best. The next section presents consistent and superior performance of probe nets over the reference scoring approaches.

5 Results

In order to perform a fair evaluation, we fix hyper-parameters throughout the experiments and work with a fixed image resolution of \(32\times 32\) pixels. We evaluate the four proposed dataset difficulty scoring approaches against the reference dataset characterization as given in Table 3. An ideal dataset difficulty score should obey a linear dependency and match the reference DCN. The next subsection states the three alternative scoring pipelines, and Sect. 5.2 shows results achieved with our probe nets. All results are presented as correlations between the proposed score and the reference DCN as listed in Table 3. For each pipeline, we discuss in the following how correlations are affected by tuning configurations of the respective pipeline.

5.1 Silhouette, clustering and Fréchet-based scores

Fig. 6
figure 6

Prediction performance of silhouette, clustering and Fréchet-based scores. The x-axis reports the obtained score and the y-axis reports the reference classification difficulty as in Table 3

Figure 6 shows the obtained results for the silhouette (Sect. 4.1), k-means (Sect. 4.2), and Fréchet-based score (Sect. 4.3) correlated against the reference DCN. The three approaches shown are using the versions that produce best results. The results for the silhouette score produce results that range between a correlation from \(R^2 = 0.04\) to \(R^2 = 0.21\) for the considered configurations defined in Table| 5. Best results are obtained by using the structural dissimilarity index DSSIM [53] directly applied in the original domain. Alternatively, using an embedding based on a neural network followed by a mean squared error (MSE) distance metric works equally well, while using PCA to reduce the dimension or directly apply an MSE in the original domain produces weaker correlations.

For the evaluation of the proposed k-means-based scoring pipeline (see Sect. 4.2), we cluster the images in C clusters, where C is the known number of categories in the dataset. For a faster convergence, we initialize the centroids with the average image of each class. k-means runs based on the Euclidean \(L_2\) distance with a stopping tolerance of \(10^{-4}\) and a maximum of 300 iterations without random restarts. We tested for preprocessing options: none, resize to smaller image, apply PCA, or use an auto encoder prior to clustering, in junction with the following aggregation metrics accuracy on the estimated confusion matrix (AECM), the adjusted mutual information score, the adjusted random score, the v-measure, the homogeneity score, and the completeness score. Correlation results range between \(R^2 = 0.01\) to \(R^2 = 0.26\) where best results are obtained when applying PCA and using AECM. The weak performance of the k-means clustering is due to known limitations, such as no global minimum guarantee and a simplistic distance metric that ignores the spatial information. k-means clustering-based pipelines are \(5.2\times \) (no preprocessing) up to \(50.5\times \) (PCA to low dimension) faster than silhouette score-based pipelines (comparison includes the faster MSE timings) when comparing execution times in terms of average compute time per input sample.

The rightmost plot in Fig. 6 shows the Fréchet-based scoring performance. Compared with the two previous approaches, that are weakly correlated with the true dataset difficulty, the FID-based score is strongly correlated with the true difficulty. The evaluated four variants, using an embedding dimensionality of \(d=\) 64, 192, 768 or 2048, produce correlations that range from \(R^2 = 0.71\) to \(R^2 = 0.75\). The best Fréchet-based score is achieved with the \(d=192\).

Fig. 7
figure 7

Prediction performance of probe nets: the x-axis reports the accuracy reached of a converged probe net and the y-axis reports the reference DCN. All seven static probe nets, a regular/narrow/wide, b shallow/deep, c shallow/deep normalized, and the three dynamic probe nets, d mlp, kernel depth scaled, and e length scaled are strongly correlated with the reference DCN

5.2 Probe nets

Figure 7 shows all obtained correlations of the ten proposed probe nets against the reference DCN. The probe nets, as presented in Fig. 5, are trained with the same constant configuration and data augmentation parameters as used to produce the reference results. We follow the data augmentation described in [29], and we use the RMSProp [50] optimizer to minimize the average cross-entropy with a learning rate of \(10^{-4}\). All evaluations employ the He initialization [19] with a gain factor 1.0 and a constant batch size of 32. Training is run for 100 epochs. The probe nets share a high correlation with the reference DCN that ranges between \(R^2 = 0.63\) and \(R^2 = 0.95\) and consistently outperform results achieved with the silhouette-based approach or k-means-based approach, see Sects. 4.2 and 4.2. Seven out of the ten networks exceed a high correlation of \(R^2 > 0.79\) and henceforth outperform the Fréchet-based approach as well, see Sect. 4.3.

Figure 7a shows an increasing correlation of \(R^2 = 0.75\) to \(R^2 = 0.93\) between narrow, regular, and wide probe nets and the reference DCN. This can be explained by the better generalization ability of the network with more degrees of freedom, at the cost of an increased execution time. Properties of the Probe nets are provided in Table 6. Deep probe nets topologies outperform their shallow counterparts. This effect is even more prominent in the normalized case, Fig. 7b versus d. We observe that a better generalization performance is mainly driven by a larger amount of tunable parameters that comes at the cost of increased execution timings. Figure 7c and e show the results for probe nets that dynamically adapt the architecture topology to the number of classes. The dependency of the architecture on the number of classes implies different execution times on datasets with different number of classes.

5.3 Metric alternatives

Figures 6 and 7 present correlations of the best performing configurations per pipeline against the reference DCN as presented in Table 2. Figure 8 justifies that choice and demonstrates the robustness against alternative choices. They include the average accuracy of the ensemble of all reference models as listed in Table 2, the accuracy of specific models, and the aggregated distribution of correlations computed against all reference models. As Fig. 8 shows all options yield consistent correlations validating our proposed reference DCN as a suitable proxy for the reference complexity of the classification task. Probe nets are consistently providing higher correlations than the tree alternative scores. Those findings are robust against different configurations in the pipelines and the proxy metric of the dataset complexity.

Fig. 8
figure 8

The used proxy metric, the reference DCN, is as good as alternative choices that provide consistent results, such as using the average accuracy over all reference models, specific models, or the aggregation of all models. Probe nets outperform the three scoring pipelines

6 Efficient evaluation of probe nets

Fig. 9
figure 9

Evolution of the prediction quality over training epochs of the deep normalized probe net. The regression quality reaches high values within a few epochs, while the average accuracy difference between the probe net and the reference DCN is further decreased for longer training

As presented in Sect. 5.2, probe nets have a good predictive behavior of what a reference network achieves on a given dataset. However, that information is more valuable if it can be computed faster than training large models. The way probe nets are constructed give them an inherent computational benefit over the full model. In addition, we exploit early stopping of the learning to further reduce the computational time of the probe net. Note that we can stop the probe net before convergence, since we are interested in the learning trend that characterizes the problem’s difficulty, not in the final accuracy. Figure 9 shows how the prediction quality improves for a deep normalized probe net with an increasing amount of epochs for which it is trained on all datasets. Even within the first epoch, the regression quality outperforms the FID-based approach. The mean accuracy difference between the probe nets and the reference DCN (trained till convergence) is further decreased, meaning that the probe nets are not yet converged and are still increasing their own classification performance.

Fig. 10
figure 10

Normalized execution time of a deep normalized probe net. Scenario A uses standard settings, while in scenario B the average execution time is significantly reduced if the probe net is trained with a large batch size of 1024 samples and 8 threads are used to perform the on-the-fly standard preprocessing, including padding and random cropping, random horizontal flips and normalization. In scenario C) processing times are further reduced if the preprocessing is minimized to the required cast and transformation from the CPU to the GPU

We evaluate all probe nets with the same settings, using a batch size of 128 samples and 2 threads during on-the-fly preprocessing, as the reference network timings shown in Fig. 9. In the case of the reference, the average batch time was \(194 \,\mathrm {ms}\) and was compute bound by the operations performed on the GPU for most of the models. However, we observed that for small models, and this is especially relevant for the designed probe nets, the GPU might not be fully nor optimally utilized since relative overheads of batch preparation and loading have a relative higher impact against the lowered GPU workload of very small networks. In such settings, the overall timing performance is strongly dependent on implementation details and the underling hardware. Since that is not the case for medium and large sized networks, we used a standardized setting of a multi-threaded data parallel loader that performs batch preparation on the CPU side. All measurements include batch preparation times consisting of padding, random cropping, random horizontal flipping and normalization. Since CPU and GPU operation are performed in parallel, most of the batch preparation times are hidden behind the GPU computation or are small compared to the GPU workload such that they do not matter. However, for small networks, such as the probe nets, we found it beneficial to fine tune the settings. Running with larger batch sizes helps to reduce kernel overheads and leads to better utilization and execution performance. We tested batch sizes of powers of two \(2^i\) from \(2^0 = 1\) up to \(2^{13} = 8192\) and observed best performance for a batch size of 1024. In the performance measurement we assumed a dataset size of 8192 samples, e.g., the number of batches measured is given as \(2^{13 - i}\), which is relevant for filling/ flushing effects of the pipeline. For example, the corner case of the batch size 8192 causes only one batch triggered to be executed, which cause the GPU to stall and wait till the first batch preparation has fully finished. Figure 10 shows the normalized execution time for 128 samples with A) standard settings, B) optimized settings using a larger batch size of 1024 samples and using 8 instead of 2 threads and C) the optimized settings but with a minimal batch preparation pipeline that only casts CPU float arrays into GPU allocated tensors. The optimized settings B) allow to run the deep normalized probe net within 10.2 ms per 128 samples, that is \(19.4\times \) faster than the average execution speed of the reference models. The training speed of the CIFAR10 dataset results of about four seconds per epoch and completes the 100 epochs within seven minutes.

Fig. 11
figure 11

System overview of an efficient architecture search. First, the dataset is characterized, second TAPAS uses a genetic algorithm that predicts performance of candidate architectures to find a suited architecture, and third, the selected architecture is finally trained on the given dataset

Table 7 Run time complexity of the main stages of the TAPAS-based pipeline as depicted in Fig. 11
Fig. 12
figure 12

Reprinted with permission [26]. Predicted vs real performance (i.e., after training) of random architectures. Left plot: TAP trained without DCN or LDE pre-filtering. Middle plot: TAP trained with DCN, but LDE is not pre-filtered. Right plot: TAP trained only on LDE experiments with similar dataset difficulty

7 Application scenario: dataset characterization for fast architecture search

In this section, we demonstrate how the dataset characterization enables efficient architecture search. An in-depth discussion of the broad field of automated architecture search is widely covered in literature  [1, 5, 34, 38, 41, 55, 57, 58]. Automated architecture search enables to discover new network models that might outperform reference models on a given dataset and henceforth address the discovered problem that arises when just applying given reference networks as done in Sect. 3. However, in a traditional approach a large amount of candidate networks has to be fully trained on a given dataset, either requiring a vast amount of computing resource or taking a long time to complete. To that end, approaches that accelerate learning during architecture search have been proposed. Early stopping based on learning curve predictors, or transferred learning from model weights to the next candidate model was applied with success. Probe nets enable fast execution and good classification difficulty estimation that provide insights into datasets. Classification difficulty estimation enables to select a problem tailored model prior to invest resources and time to fully train all alternatives. Using the knowledge of dataset difficulty estimation is a key contribution to bias and optimize architecture searches toward models of fitted complexity. We recently proposed a method called train-less accuracy predictor for architecture search (TAPAS) that demonstrates how the probe nets contribute to bias and perform an architecture search. In this section, we present TAPAS that is able to perform the complete architecture search without training or retraining during the architecture search [26] at all by tuning its search based on the DCN.

7.1 TAPAS system setup

Figure 11 shows the high level system that is invoked for the train-less architecture search of TAPAS. In this section, we provide a high-level system description focusing on the use of the dataset characterization, whereas a detail technical discussion of the internal operation of TAPAS is found in the literature [26]. The core of TAPAS performs a genetic evolution algorithm, similar as proposed [41], however with the key differentiation that none of the candidate models are trained. In contrast, an accuracy predictor (AP) has been trained beforehand to predict the accuracy given a network topology description and the dataset characterization. A prediction-based driven genetic algorithm can be executed several orders of magnitude faster than the same algorithm that would require to train each candidate network. In order to work properly, the accuracy predictor requires to gain and reuse knowledge about the dataset in order to accurately predict model performance for different datasets. To that end, the dataset characterization is used twice, offline to train the AP, and online as input to the AP to bias all predictions to that data at hand. Given a new dataset instance, as shown in Fig. 11, the dataset characterization is computed as first step, since it is required as input for TAPAS to work. After TAPAS search yields a valid network, it is trained on the given data and returned to the user. Table 7 summarizes the time complexity for each of the three steps. TAPAS execution time does not depend on the size of the input dataset nor of the run time complexity of the models it predicts performance for. It solely depends on the number of evaluations, typically a constant that is given due to the population size and the number of iterations in the evolution algorithm, and the complexity of the encoded candidate model description. In all our experiments, TAPAS was able to execute in less than a few minutes. Both the DCN computation and the training time of the candidate model are proportional to the dataset size, the invoked model complexity, and the number of epochs the models are trained for. Since both contributions are of the same order and contribute to most of the time spent in the overall pipeline, the DCN computation becomes time critical. For the DCN stage, the probe net defines the model complexity and is designed to be lightweight and constant. In contrast, the complexity of the candidate model is itself a function of the resulting architecture found by TAPAS causing model-based execution time variations among different runs. As discussed in Sect. 6, the DCN can be computed with a low number of epochs resulting in superior execution time advantages over typical training settings used to train the candidate model.

Table 8 Obtained accuracy with TAPAS-based pipeline as depicted in Fig. 11

7.2 Results

Figure 12 depicts the dependency of the dataset characterization on the TAP performance [26] when applied to completely unseen datasets. To that end, TAPs performance was measured on one unseen datasets while the TAP has been trained on the remaining datasets. The full experiment is cross-validated among all datasets. Without the DCN, the TAP is not able to well predict the accuracy of candidate models on the given dataset, resulting in a very loose cross-correlation of \(r^2=0.28\), see left plot of Fig. 12. However, if the DCN is feed and used by the TAP, its performance is increased to a high cross-correlation. We observed that it turns out to be beneficial to use the DCN once more in order to pre-filter the available experiments in order to restrict the ground truth used to train TAP to experiments that are relevant, by selecting entries that are performed on datasets with similar dataset characterization values. The later option, relying twice on the DCN, results in high correlation between the predicted and the actual accuracy reached over the networks tested of about \(R^2=0.94\).

For completeness, Table 8 shows the accuracy obtained with the TAPAS framework. For each dataset, we repeated the pipeline \(n=10\) times generating n different architectures per dataset. Average and standard deviation over the n repetitions are given in the first column. The second column reports the best accuracy reached, and the last column reports the best results reached when reusing the probe net rather than the generated architecture in cases where the probe net outperforms the solution found by the architecture search. Reusing the probe net is a zero-cost improvement as this does not add additional computing time since the probe net is already trained within the regular high level execution flow as described in Fig. 11. This optimization turns out to be helpful on very small datasets where TAPAS might find too complex network architectures that then over-fit to the dataset. The last column of Table 8 reports the result of this optimization.

An extended study of architecture search and improvements thereon is out of scope for that work. The results presented in Table 8 are all based on TAPAS assumptions and implementation details [26]. TAPAS internally uses a block-based encoding to map network models to inputs processed by the AP. The setup of this encoding enforces block-sequential structure of all considered networks. Even though TAPAS considers already a large search space, some of the reference models considered in Sect. 3 are outside the reachable search space of TAPAS. That present limitation renders a direct comparison of models from Sect. 3 with current results unfair and is henceforth omitted.

In this section, we demonstrated how the dataset characterization fine-tunes the core part of an accuracy predictor. Without any knowledge about the data, it would not be possible to generate data and architecture specific predictions. The TAPAS pipeline as depicted in Fig. 11 is able to transfer knowledge about data and architecture to the use case-specific situation even when applied with completely unseen datasets. In contrast to traditional architecture searches that rely on fully or partially training produced candidates, TAPAS mechanism is based on prediction-based construction of suited architectures enabled through the dataset characterization. The training-free design enables to run the TAPAS-based pipeline as shown in Fig. 11 multiple order of magnitudes faster than traditional architecture search approaches.

8 Conclusion

We formulated the question to compute a score among datasets that reflect their inherent classification difficulty. We suggested four processing pipelines, a silhouette-based score, a k-means clustering-based, a Fréchet inception distance-based score and our probe net-based evaluation pipeline. The main drawback of the silhouette-based approach is the high complexity, which scales with the squared number of samples. We proposed efficient score computing pipelines based on k-means, Fréchet inception distance (FID) and probe nets that scale linear in the number of samples. k-means delivers results one complexity class faster and with slightly better prediction quality as the silhouette approach, reaching a weak correlation with reference models of \(R^2=0.26\). The FID-based approach reaches a high correlation of \(R^2=0.76\) but at the drawback that its compute times is determined by solving a matrix square root problem \(C^2\) times.

Finally, we developed the probe nets, which are small networks, and apply standard deep learning techniques in order to compute predictions that are strongly correlated with the reference DCN reaching correlations of \(R^2=0.95\). Even the worst performing probe net outperforms silhouette and k-means-based scoring with a wide quality margin. We further evaluated the fact of early stopping to reduce the data score evaluation time and observed little to no performance drop. Leveraging the small architectures of probe nets and early stopping allows to perform dataset scoring \(97\times \) faster than the required training time of the average reference model.