top of page
Search

Generalization in Neural Networks

Updated: Sep 5, 2022

Are larger networks always better able to generalize? How important is regularization?


Sources :

Understanding Deep Learning Techniques Requires Rethinking Generalization

Sensitivity and Generalization in Neural Networks: an Empirical Study


Findings Summary :

  • A two-layer neural network with 2n+d parameters can perfectly fit any dataset of n samples of dimension d.

  • Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

  • ReLU networks, with unbounded activations, tend to be more robust (i.e. generalize better) than saturating HardSigmoid networks.

Effects of Regularization on Generalization

Intuitively, if one was to take a large image dataset such as the CIFAR10, and randomize all the input labels such that there was no longer any "structured" relationship between the training instances and the expected output, the expectation would be that training a neural network, however large, would be impossible.

This experiment was made by Chiyuan Zhang et al. back in 2017 and they found that stochastic gradient descent was nonetheless able to optimize the weights to fit the random labels perfectly. Even more surprising: the networks they tested (Inception, AlexNet and MLP 1x512) were still able to fit the data after the image pixels had been shuffled and mixed with random noise from a Gaussian distribution :

ree

(*) Note that their randomization tests were made without any explicit regularization.


Before we even had large neural networks (that is to say before the advent of "deep learning"), regularization was used to prevent overfitting in cases when there were more parameters than data points.

"The basic idea is that although the original hypothesis is too large to generalize well, regularizers help confine learning to a subset of the hypothesis space with manageable complexity. By adding an explicit regularizer, for example by constraining the norm of the optimal solution, the effective Rademacher complexity of the possible solutions is dramatically reduced".

The Zhang et al. paper finds that in today's large and deep neural networks, such as the CNNs tested, regularization seems to play a different role: all of the models generalize fairly well to out-of-distribution data even with all regularizers turned off.

ree

--- Please log in to read more ---

 
 
 

Comments


© 2021 LambdaVoice by Skavenger Labs  |  All Rights Reserved

bottom of page