http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
Learning with Neural Networks in High Dimensions
Misiakiewicz, Theodor Stanford University ProQuest Dissertations & These 2023 해외박사(DDOD)
A central challenge of modern statistics concerns the curse of dimensionality,which refers to the difficulty of learning in high dimensions due to an exponential increase in degrees of freedom. Over the past half-century, researchers have developed various methods to overcome this curse, including clever feature engineering and prior assumptions on the data structure. In recent years, however, deep learning has emerged as a standard and powerful approach to exploiting massive highdimensional datasets, vastly outperforming previous handcrafted and domain-specific methods. Despite the blackbox nature of neural networks, they appear to avoid the curse of dimensionality by automatically learning relevant features and adapting to inherent structures in the data. However, understanding precisely the mechanism behind this capability remains an outstanding challenge. For example, it is unclear what relevant features neural networks can learn efficiently, how gradient-type algorithms construct these features dynamically, or how different resources (sample size, network width, and training time) should scale for solving different tasks. Historically, neural networks' approximation and statistical properties have been investigated separately from computational aspects. This gap leaves open the study of the subset of neural networks that are of practical interest, namely those efficiently constructed by gradient-based algorithms.In this thesis, we consider three models, corresponding to two-layer neural networks trained in three different optimization regimes, and investigate their properties in high dimensions. In particular, our aim is to precisely characterize in each setting how the performance of neural networks depends on the number of samples, neurons, and gradient descent iterations used in their training.• Neural networks in the linear regime:Neural networks trained in this regime effectively behave as linear methods. Specifically, we study random feature and kernel regression, which correspond respectively to finite-width and infinite-width models for linearized neural networks. We derive sharp asymptotics for the test error in a new polynomial high-dimensional scaling, under certain abstract conditions on the underlying kernel. We show that these conditions are verified by some classical high-dimensional examples. These results allow us to precisely characterize the performance of neural networks in the linear regime, including optimal width and impact of network architecture. In addition, they explain phenomena such as multiple descents in the risk curve and benign overfitting.• Neural networks in the mean-field regime:Training in this regime is non-linear and allows neural networks to perform feature learning, i.e., to construct features that are adapted to the data. As an instructive example, we consider the problem of learning multi-index functions on Boolean or isotropic Gaussian data. Those functions only depend on a latent low-dimensional subspace and are not learnable efficiently by kernel methods. In contrast, we show that in the mean-field regime, the training sequentially learns the function support with a saddle-to-saddle dynamic. The overall time complexity now depends on the target function's leap, which measures how "hierarchical" the function is, and not on the target function's smoothness as in the linear regime. In particular, this illustrates how non-linear training of neural networks can vastly outperform kernel methods by exploiting low-dimensional and hierarchical structures in the data to construct good features.• Convex neural networks: We introduce a family of convex problems over the space of infinite-width two-layer neural networks. These can be seen as generalizations of kernel methods, whereby the RKHS norm-which can be seen as a weighted ℓ2 norm over the second layer weights-is replaced by a weighted functional ℓp norm, 1 ≤ p ≤ 2. We first show that, for p > 1, these problems can be solved in polynomial time using a random feature approximation. This implies that convex neural networks are tractable methods that break the curse of dimensionality on an increasing set of functions as p decreases to 1. On the other hand, we argue that for p= 1, which achieves near-optimal statistical rates on multi-index functions, learning becomes computationally hard under some standard hardness assumptions.The picture that emerges from this study is that neural networks can realize a rich range of trade-offs between statistical and computational complexity. In particular, neural networks trained in the linear, mean-field, and convex regimes-which can be seen as implementing three different statistical learning paradigms (fixed features, feature learning, and feature selection respectively)- suffer very differently from the curse of dimensionality. While linearized neural networks are simple computational models, they are statistically inefficient in high dimensions, except for very smooth target functions. On the other hand, convex neural networks can achieve near statistical optimal rates on multi-index functions, but do not admit general tractable algorithms. Finally, non-linear training of neural networks can be seen as a tractable middle-ground between these two extremes. In particular, it outperforms fixed feature methods by allowing efficient construction of relevant features for data with low-dimensional and favorable hierarchical structure.