Maya Scolastica Updated on Mar 13, 2024 6 min read

What is Deep Learning?

Deep learning is powering the future, from self-driving cars to creative AI tools. But how does it work, and why is it so effective? This exploration delves into the mysteries of deep learning.

Deep learning, a subfield of machine learning, has become a driving force behind some of the most transformative technologies of our time. From self-driving cars and facial recognition to virtual assistants and creative tools like text-to-image generators, deep learning is changing the way we live, work, and interact with the world. But for many, even those with a basic understanding of machine learning, the inner workings of deep learning remain shrouded in mystery.

This blog post, based on the book "Understanding Deep Learning" by Simon J.D. Prince, aims to exploring the factors that contribute to deep learning's effectiveness and highlighting some of the ongoing research aimed at understanding its inner workings.

Supervised Learning: Learning from Labeled Examples

The journey begins with supervised learning, where we teach a machine learning model by providing it with labeled examples. Imagine a child learning to distinguish between cats and dogs. We show them pictures and say, "This is a cat," or "This is a dog." The child gradually learns to identify the key features that differentiate these animals.

Supervised learning works similarly. We provide the model with a dataset containing input-output pairs. For example, to predict house prices, the inputs might be features like square footage and number of bedrooms, while the outputs are the corresponding prices. The model, often a deep neural network, learns to map these inputs to outputs by adjusting its internal parameters to minimize the prediction errors.

This framework can be applied to various tasks, including:

Regression: Predicting continuous outputs, like house prices or stock market values.
Classification: Assigning inputs to discrete categories, like identifying handwritten digits or classifying emails as spam or not spam.

Shallow vs. Deep Neural Networks: The Power of Depth

Neural networks are the workhorses of deep learning. They are inspired by the human brain and consist of interconnected layers of "neurons" that process information. Shallow networks have only one hidden layer, while deep networks have multiple hidden layers. This depth allows them to learn hierarchical representations of the data, where each layer builds upon the previous one to extract increasingly abstract and meaningful features.

While shallow networks can theoretically approximate any continuous function, deep networks often require far fewer parameters to achieve the same level of accuracy. This efficiency, known as depth efficiency, allows deep networks to learn more complex and nuanced relationships in the data.

Beyond Simple Predictions: Structured Outputs and Generative Models

Many real-world problems involve predicting outputs with complex structures. For example, translating a sentence from English to French involves predicting a sequence of words that adheres to the grammatical rules of French. Similarly, generating an image from a text caption requires producing an image with realistic textures and spatial relationships between pixels.

Deep learning models can tackle these challenges by learning the structure of the output space. This can be achieved through structured output prediction, where the model explicitly learns the grammar or rules that govern the output. Alternatively, generative models learn to synthesize new data examples that are statistically indistinguishable from the training data. This allows them to produce outputs with complex structures by implicitly learning the underlying rules.

Unsupervised Learning: Discovering Hidden Patterns

The journey culminates with unsupervised learning, where the model learns from unlabeled data, discovering hidden patterns and structures without explicit guidance. This is particularly useful when labeled data is scarce or expensive to obtain.

Several types of unsupervised models exist, but this book focuses on generative models, which learn to synthesize new data examples that resemble the training data. Examples include:

Generative Adversarial Networks (GANs): These models pit a generator network, which creates new samples, against a discriminator network, which tries to distinguish real data from generated samples. This adversarial training pushes the generator to produce increasingly realistic outputs.
Normalizing Flows: These models transform a simple, known probability distribution into a complex one that matches the training data. This allows them to both generate new samples and evaluate the probability of new data points.
Variational Autoencoders (VAEs): These models learn a latent space, a lower-dimensional representation of the data, and use this latent space to generate new examples. While VAEs offer a solid probabilistic foundation, their sample quality can be lower than that of GANs or diffusion models.
Diffusion Models: These models gradually corrupt the data with noise and then learn to reverse this process, starting from pure noise and progressively denoising until a realistic sample is generated. Diffusion models have recently gained significant attention due to their ability to produce high-quality images and their ease of training.

Why Deep Learning Should Not Work (But Does)

Traditional machine learning models often struggle with high-dimensional data, like images or text, due to the "curse of dimensionality." This phenomenon describes how the volume of data required to adequately represent a function grows exponentially with the number of input dimensions. Deep learning models, however, seem to defy this curse. They excel at processing high-dimensional data, even when the number of training examples is dwarfed by the number of possible inputs.

Furthermore, deep networks are notorious for being highly complex, capable of describing functions with an astronomical number of linear regions. This complexity raises concerns about overfitting, where the model simply memorizes the training data and fails to generalize to new, unseen examples. Yet, deep learning models often generalize remarkably well, even when they are significantly overparameterized (having more parameters than training data points).

So, why does deep learning work so well, seemingly defying theoretical expectations? While a definitive answer remains elusive, researchers have identified several contributing factors:

Factors Influencing Training Success

Overparameterization: The abundance of parameters in deep networks is thought to create a vast landscape of solutions that fit the training data well. This "haystack of needles" (Sejnowski, 2020) makes it easier for optimization algorithms to find a good solution, potentially avoiding local minima and saddle points that plague traditional models.
Activation Functions: The choice of activation function, which introduces non-linearity into the network, plays a crucial role. ReLU (Rectified Linear Unit) and its variants have proven particularly effective, likely due to their simple form and well-behaved gradients.
Implicit Regularization: Stochastic gradient descent (SGD), the workhorse optimization algorithm for deep learning, exhibits an implicit bias towards solutions that generalize well. This bias, which can be understood as adding a regularization term to the loss function, encourages the model to find smoother and more stable solutions.

Understanding the Loss Landscape

Visualizing the loss function of a deep network is challenging due to its high dimensionality. However, research has shed light on its properties:

Multiple Global Minima: Due to symmetries and redundancies in network architecture, there are often many equivalent solutions that achieve the same (near-zero) training loss. This degeneracy further contributes to the ease of training.
Connectivity of Minima: While different minima might not be directly connected by a straight line of low loss, there is evidence that they are often connected by a "manifold" of low loss, suggesting a smoother and more navigable loss landscape.
Curvature and Flatness: Studies have shown that the loss surface around minima tends to be flatter in deep networks, particularly when the weights are initialized appropriately. This flatness implies that small errors in the estimated parameters are less detrimental to performance, potentially contributing to better generalization.

Factors Contributing to Generalization

Implicit Regularization of SGD: As mentioned earlier, SGD seems to implicitly regularize the model, favoring solutions that generalize well. This effect is amplified by smaller batch sizes and larger learning rates, potentially explaining the success of these training strategies.
Flatness of Minima: The flatness of the loss surface around minima is also thought to contribute to better generalization. By avoiding sharp minima, the model becomes less sensitive to small changes in the parameters, making it more robust to variations in the test data.
Architecture and Inductive Bias: Choosing an architecture that matches the data structure is crucial. Convolutional networks, for example, excel at processing images due to their built-in assumptions about spatial relationships between pixels. This inductive bias helps the model generalize better than a generic architecture.
Oversmoothing in Graph Neural Networks: Interestingly, increasing depth in graph neural networks can lead to "oversmoothing," where local information is washed out, hindering performance. This suggests that the optimal depth might vary depending on the data type and task.

Conclusion

The field of deep learning is driven by a fascinating interplay between empirical observations and theoretical investigations. While the success of deep learning is undeniable, understanding the reasons behind this success remains an ongoing quest. This quest is crucial not only for satisfying scientific curiosity but also for developing more robust, efficient, and ethical AI systems that can truly benefit society.

This blog post has offered a glimpse into the world of deep learning, highlighting its core principles and exploring the fascinating journey from supervised to unsupervised learning. While many questions remain unanswered, the ongoing quest to understand deep learning is crucial for advancing the field and ensuring its responsible development and deployment. As deep learning continues to shape our future, engaging with its concepts and participating in discussions about its ethical and societal implications is essential.

Updated on Mar 13, 2024