# PNAS

2018-ECCV-Progressive Neural Architecture Search

- Johns Hopkins University & Google AI & Stanford
- GitHub: 300+ stars
- Citation: 504

## Motivation

current techniques usually fall into one of two categories:

evolutionary algorithms(EA)orreinforcement learning(RL).

Although both EA and RL methods have been able to learn network structures that outperform manually designed architectures, they require

significant computational resources.

The two current nas methods, EA and RL, have computational costs

## Contribution

we describe a method that requiring

5 times fewer model evaluationsduring the architecture search.

Only one-fifth of the models need to be evaluated.

We propose to use

heuristic searchto search the space of cell structures,starting with simple (shallow) models and progressing to complex ones, pruning out unpromising structures as we go.

Incremental search, starting from the shallow network, gradually searches for complex networks.

Since this process is expensive,

we also learn a model or surrogate function(Substitution function)which canpredict the performance of a structure without needing to training it.

An evaluation function (predictor) that approximates the quality of a model is presented to directly predict model performance rather than training candidate networks from scratch.

Several advantages:

First, the simple structures

train faster, so we get some initial results totrain the surrogate quickly.

Proxy networks are small and fast (at a negligible cost).

Second, we only ask the surrogate to

predict the quality of structures that are slightly different(larger) from the ones it has seen

The predictor only needs to predict slightly different networks.

Third, we

factorize(decompose) the search space into a product(product) of smaller search spaces, allowing us to potentiallysearch models with many more blocks.

Decomposition a large search space into the product of a small search space.

we show that our approach is

5 times more efficient than the RL methodof [41] in terms of number of models evaluated, and8 times faster in terms of total compute.

The efficiency is five times higher than that of the RL method and the total computation is eight times faster.

## Method

### Search Space

we

first learn a cell structure, and thenstack this cell a desired numberof times, in order to create the final CNN.

Learn the cell structure first, then stack the cells to the target number of layers.

A cell receives a tensor of HxWxF, outputs a tensor of HxWxF if the cell's stride=1, and outputs a tensor of H/2 x W/2 x 2F if the cell's stride=2.

A cell consists of B blocks, with 2 input s and 1 output per block, and each block can be represented by a five-tuple \(left(I_{1}, I_{2}, O_{1}, O_{2}, C\right)\, the output of the CTH cell is represented as \(H^c\), and the output of the BTH block of the CTH cell is represented as \(H^c_b\).

Each block's input is a collection of the output of all blocks before {this block} and {the output of the last cell, the output of the last cell} in the current cell.

Operator has eight operations in its selection space.

we stack a

predefined numberof copies of thebasic cell(with thesame structure, butuntied weights Do not inherit weights), using eitherstride 1 or stride 2, as shown in Figure 1 (right).

After finding the best cell structure, stack the predefined number of layers to form the full network on the right, without inheriting the weights (retraining).

The number of stride-1 cells between stride-2 cells is then adjusted accordingly with up to N number of repeats.

The number of Normal cell s (stride=1) depends on N (superparameter).

we

only use one cell type(we donot distinguish between Normal and Reduction cells, but insteademulate a Reduction cell by using a Normal cell with stride 2),

We do not distinguish between normal cell and Reduction cell, only set the stride of Normal cell to 2 as Reduction cell.

### Progressive Neural Architecture Search

Many previous approaches directly search in the space of full cells, or worse, full CNNs.

The previous approach searched directly for the entire cell structure, or worse, the entire cnn.

While this is a more direct approach, we argue that it is difficult to directly navigate in an

exponentially large search space, especially at the beginning where there is no knowledge of what makes a good model.

Although this approach is straightforward, the search space is too large and at first we don't have any prior knowledge to guide us in which direction to search in the vast search space.

Search starts with one block per cell.Train all possible\(B_1\), with \(B_1\) Train the predictor, and then \(B_1\) Expand to \(B_2\).

Train all possible\(B_2\) is too expensive, so we use a predictor to evaluate all \(B_2\)-cell performance and select the best K\(B_2\- cell, repeat the process (with K selected\(B_2\)-cell training predictor, K\(B_will be selected2\) -cell expanded to \(B_3\) and select the best K with the predictor.

### Performance Prediction with Surrogate Model

Requirement of Predictor:

- Handle variable-sized inputs
- Correlated with true performance
- Sample efficiency (simple and efficient)
- The requirement that the predictor be able to handle variable-sized strings immediately suggests the use of an RNN.

Two Predictor method

RNN and MLP (Multilayer Sensor)

However, since the sample size is very small, we fit an

ensemble of 5 predictors, We observed empirically that thisreduced the varianceof the predictions.

Because the sample is simple, integrating five predictors (RNN-ensemble, MLP-ensemble) can reduce variance.

## Experiments

### Performance of the Surrogate Predictors

we train the predictor on the observed performance of cells with up to

bblocks, but we apply it to cells withb+1blocks.

Train on {B=b} and predict on the set of {B=b+1}.

We therefore consider predictive accuracy both for cells with sizes that have been seen before (but which have not been trained on), and for cells which are one block larger than the training data.

The prediction accuracy on the {B=b} untrained cell set and {B=b+1} cell set are also considered.

Randomly select 10k of all {B=b} cell collections as datasets\(U_{b, 1:R}\), training 20 epochs.

randomly select K = 256 models (each of size b) from \(U_{b,1 :R}\)to generate a training set \(S_{b,t,1:K}\);

256 training sets S were randomly selected from dataset U for each round.

A total of 20*256=5120 data points will be trained.

We now use this random dataset to evaluate the performance of the predictors using the pseudocode(Pseudocode) in Algorithm 2, where

A(H)returns the true validation set accuracies of the models in some setH.

A(H) returns the true accuracy of the cell's set H after training.

When B=b, the training set is a subset of all {B=b} cells. The first behavior is the correlation between predicted and real results on the training set (256*20=5120) of all {B=b} cells.

The second behavior is the correlation between predicted and true results on all {B=b+1} cell datasets (10k).

We see that the predictor

performs well on models from the training set, butnot so wellwhen predicting larger models. However,performance does increaseas the predictor is trained on more (and larger) cells.

The predictor performs well on the training set {B=b}, but does not perform well on the larger dataset {B=b+1}, but gets better as B increases.

We see that for predicting the

training set, the RNN does betterthan theMLP, but for predicting the performance on unseen larger models(which is the settingwe care about in practice), theMLP seems to do slightly better.

The predictor of the RNN method performs better on the training set {B=b} and MLP performs better on the larger dataset {B=b+1} (we are concerned about)

## Conclusion

The main contribution of this work is to show how we can

accelerate the searchfor good CNN structures by usingprogressive searchthrough the space of increasingly complex graphs

Accelerate NAS using progressive (cell-depth increasing) searches

combined with a

learned prediction functionto efficientlyidentify the most promising modelsto explore.

Use a learnable predictor to identify the potential optimal network.(A P-network is introduced to search for the optimal structure of the target network.eg. Use the C network to search for the best structure of the B network, and the B network is to search for the best structure of the A network, dolls)

The resulting models achieve the

same level of performanceas previous work but with afraction of the computational cost.

SOTA achieved at a small price