mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 17:31:04 +08:00
a90624b36d
SKILL.md: 478-line PINN training best practices (complexity ladder, nondim, architecture, optimization, loss design, sampling, property mappings, ConFIG, domain decomposition). docs/evidence/: 6 files -- krishnapriyan2021, sukumar2022, wang2022 causal, wang2022+2023 expert guides, Brunton youtube transcripts. Missing evidence (to fetch): Wang 2001.04536 (gradient pathologies), Rathore 2402.01868 (ICML loss landscape). Author: wassname (https://github.com/wassname)
1491 lines
104 KiB
Markdown
1491 lines
104 KiB
Markdown
Title: 2308.08468v1.pdf
|
||
|
||
URL Source: https://arxiv.org/pdf/2308.08468
|
||
|
||
Published Time: Sat, 19 Aug 2023 00:01:03 GMT
|
||
|
||
Number of Pages: 36
|
||
|
||
Markdown Content:
|
||
# AN EXPERT ’S GUIDE TO TRAINING PHYSICS -INFORMED
|
||
|
||
# NEURAL NETWORKS
|
||
|
||
Sifan Wang
|
||
|
||
Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104
|
||
|
||
sifanw@sas.upenn.edu
|
||
|
||
Shyam Sankaran
|
||
|
||
Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104
|
||
|
||
shyamss@seas.upenn.edu
|
||
|
||
Hanwen Wang
|
||
|
||
Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104
|
||
|
||
wangh19@sas.upenn.edu
|
||
|
||
Paris Perdikaris
|
||
|
||
Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104
|
||
|
||
pgp@seas.upenn.edu
|
||
|
||
# ABSTRACT
|
||
|
||
Physics-informed neural networks (PINNs) have been popularized as a deep learning framework that can seamlessly synthesize observational data and partial differential equation (PDE) constraints. Their practical effectiveness however can be hampered by training pathologies, but also oftentimes by poor choices made by users who lack deep learning expertise. In this paper we present a series of best practices that can significantly improve the training efficiency and overall accuracy of PINNs. We also put forth a series of challenging benchmark problems that highlight some of the most prominent difficulties in training PINNs, and present comprehensive and fully reproducible ablation studies that demonstrate how different architecture choices and training strategies affect the test accuracy of the resulting models. We show that the methods and guiding principles put forth in this study lead to state-of-the-art results and provide strong baselines that future studies should use for comparison purposes. To this end, we also release a highly optimized library in JAX that can be used to reproduce all results reported in this paper, enable future research studies, as well as facilitate easy adaptation to new use-case scenarios.
|
||
|
||
# 1 Introduction
|
||
|
||
Recent advances in deep learning have revolutionized fields such as computer vision, natural language processing and reinforcement learning [ 1, 2, 3]. Powered by rapid growth in computational resources, deep neural networks are also increasingly used for modeling and simulating physical systems. Examples of these include weather forecasting [4, 5, 6], quantum chemistry [7, 8] and protein structure prediction [9]. Notably, the fusion of scientific computing and machine learning has led to the emergence of physics-informed neural networks (PINNs) [ 10 ], an emerging paradigm for tackling forward and inverse problems involving partial differential equations (PDEs). These deep learning models are known for their capability to seamlessly incorporate noisy experimental data and physical laws into the learning process. This is typically accomplished by parameterizing unknown functions of interest using deep neural networks and formulating a multi-task learning problem with the aim of matching observational data and approximating an underlying PDE system. Over the last couple of years, PINNs have led to a series of promising results across a range of problems in computational science and engineering, including fluids mechanics [ 11 , 12 , 13 ], bio-engineering [ 14 , 15 ], materials [ 16 , 17 , 18 ], molecular dynamics [ 19 ], electromagnetics [20, 21], geosciences [22, 23], and the design of thermal systems [24, 25].
|
||
|
||
> arXiv:2308.08468v1 [cs.LG] 16 Aug 2023
|
||
|
||
Despite some empirical success, PINNs are still facing many challenges that define open areas for research and further methodological advancements. In recent years, there have been numerous studies focusing on improving the performance of PINNs, mostly by designing more effective neural network architectures or better training algorithms. For example, loss re-weighting schemes have emerged as a prominent strategy for promoting a more balanced training process and improved test accuracy [ 26 , 27 , 28 , 29 ]. Other efforts aim to achieve similar goals by adaptively re-sampling collocation points, such as importance sampling [ 30 ], evolutionary sampling [ 31 ] and residual-based adaptive sampling [32 ]. Considerable efforts have also been dedicated towards developing new neural network architectures to improve the representation capacity of PINNs. Examples include the use of adaptive activation functions [ 33 ], positional embbedings [ 34 , 35 ], and novel architectures [ 26 , 36 , 37 , 38 , 39 , 40 ]. Another research avenue explores alternative objective functions for PINNs training, beyond the weighted summation of residuals [ 41 ]. Some approaches incorporate numerical differentiation [ 42 ], while others draw inspiration from Finite Element Methods (FEM), adopting variational formulations [ 43 , 44 ]. Other approaches propose adding additional regularization terms to accelerate training of PINNs [45, 46]. Lastly, the evolution of training strategies has been an area of active research. Techniques such as sequential training [ 47 , 48 ] and transfer learning [ 49 , 50 , 51 ] have shown potential in speeding up the learning process and yielding better predictive accuracy. While new research on PINNs is currently being produced at high frequency, a suite of common benchmarks and baselines is still missing from the literature. Almost all existing studies put forth their own collection of benchmark examples, and typically compare against the original PINNs formulation put forth by Raissi et al. , which is admittedly a weak baseline. This introduces many difficulties in systematically assessing progress in the field, but also in determining how to use PINNs from a practitioner’s standpoint. To address this gap, this work proposes a training pipeline that seamlessly integrates recent research developments to effectively resolve the identified issues in PINNs training, including spectral bias [ 52 , 35 ], unbalanced back-propagated gradients [ 26 , 27 ] and causality violation [ 53 ]. In addition, we present a variety of techniques that could further enhance performance, shedding light on some practical tips that form a guideline for selecting hyper-parameters. This is accompanied by an extensive suite of fully reproducible ablation studies performed across a wide range of benchmarks. This allows us to identify the setups that consistently yield the state-of-the-art results, which we believe should become the new baseline that future studies should compare against. We also release a high-performance library in JAX that can be used to reproduce all findings reported in this work, enable future research studies, as well as facilitate easy adaptation to new use-case scenarios. As such, we believe that this work can equally benefit researchers and practitioners to further advance PINNs and deploy them in more realistic application settings. The rest of this paper is organized as follows. In section 2, we provide a brief overview of the original formulation of PINNs as introduced by Raissi et al. [ 10 ], and outline our training pipeline. From Section 3 to Section 5, we delve into the motivation and implementation details of the key components of the proposed algorithm. These consist of non-dimensionalization, network architectures that employ Fourier feature embeddings and random weight factorization, as well as training algorithms such as causal training, curriculum training and loss weighting strategies. Section 6 discusses various aspects of PINNs that lead to improved stability and superior training performance Finally, in section 7 we validate the effectiveness and robustness of the proposed pipeline across a wide range of benchmarks and showcase state-of-the-art results.
|
||
|
||
# 2 Physics-informed Neural Networks
|
||
|
||
Following the original formulation of Raissi et al. , we begin with a brief overview of physics-informed neural networks (PINNs) [ 10 ] in the context of solving partial differential equations (PDEs). Generally, we consider PDEs taking the form
|
||
|
||
ut + N [u] = 0 , t ∈ [0 , T ], x ∈ Ω, (2.1) subject to the initial and boundary conditions
|
||
|
||
u(0 , x) = g(x), x ∈ Ω, (2.2)
|
||
|
||
B[u] = 0 , t ∈ [0 , T ], x ∈ ∂Ω, (2.3) where N [·] is a linear or nonlinear differential operator, and B[·] is a boundary operator corresponding to Dirichlet, Neumann, Robin, or periodic boundary conditions. In addition, u describes the unknown latent solution that is governed by the PDE system of Equation (2.1). We proceed by representing the unknown solution u(t, x) by a deep neural network uθ (t, x), where θ denotes all tunable parameters of the network (e.g., weights and biases). This allows us to define the PDE residuals as
|
||
|
||
Rθ (t, x) = ∂uθ
|
||
|
||
∂t (tr , xr ) + N [uθ ]( tr , xr ) (2.4) 2Figure 1: Illustration of the proposed training pipeline. The procedure begins with the non-dimensionalization of the PDE system, ensuring that input and output variables are in a reasonable range. Subsequently, an appropriate network architecture is constructed to represent the unknown PDE solution. The use of Fourier feature embeddings and random weight factorization is highly recommended for mitigating spectral bias and accelerating convergence. The training phase of the PINN model integrates various advanced algorithms, including self-adaptive loss balancing, causal training , and curriculum training. Then, a physics-informed model can be trained by minimizing the following composite loss function
|
||
|
||
L(θ) = Lic (θ) + Lbc (θ) + Lr (θ), (2.5) where
|
||
|
||
Lic (θ) = 1
|
||
|
||
Nic Nic X
|
||
|
||
> i=1
|
||
|
||
uθ (0 , xiic ) − g(xiic ) 2 , (2.6)
|
||
|
||
Lbc (θ) = 1
|
||
|
||
Nbc Nbc X
|
||
|
||
> i=1
|
||
|
||
B[uθ ]( tibc , xibc ) 2 , (2.7)
|
||
|
||
Lr (θ) = 1
|
||
|
||
NrNrX
|
||
|
||
> i=1
|
||
|
||
Rθ (tir , xir ) 2 . (2.8) Here {xiic }Nic
|
||
|
||
> i=1
|
||
|
||
, {tibc , xibc }Nbc
|
||
|
||
> i=1
|
||
|
||
and {tir , xir }Nr
|
||
|
||
> i=1
|
||
|
||
can be the vertices of a fixed mesh or points that are randomly sampled at each iteration of a gradient descent algorithm. Notice that all required gradients with respect to input variables or network parameters θ can be efficiently computed via automatic differentiation [54]. However, as demonstrated by recent work, several critical training pathologies prevent PINNs from yielding accurate and robust results. These pathologies include spectral bias [ 52 , 35 ], causality violation [ 53 ], and unbalanced back-propagated gradients among different loss terms [ 26 ], etc. To address these issues, we propose a training pipeline that integrates key recent advancements, which we believe are indispensable for the successful implementation of PINNs. As shown in Figure 1, the pipeline consists of three main steps, PDE non-dimensionalization, choosing suitable network architectures and employing appropriate training algorithms. Further details are provided in Algorithm 2. In the following sections, we will carefully demonstrate the motivation and necessity of each component in the proposed algorithm and validate its effectiveness via a wide range of benchmarks.
|
||
|
||
# 3 Non-dimensionalization
|
||
|
||
It is well-known that data normalization is an important pre-processing step in traditional deep learning, which typically involves scaling the input features of a data-set so that they have similar magnitudes and ranges [ 55 , 56 ]. However, this process may not be generally applicable for PINNs as the target solutions are typically not available when solving forward PDE problems. In such cases, it is important to ensure that the target output variables vary within a reasonable range. One way to achieve this is through non-dimensionalization. It is a common technique used in mathematics and physics. to simplify and analyze complex systems by transforming the original system into an equivalent dimensionless system. This is performed by selecting one or more fundamental units or characteristic values, and scaling the variables in the problem so that they become dimensionless and of order one. From our experience, non-dimensionalization plays a crucial role in building physics-informed models especially for dealing with experimental data or real-world problems. The reasons are shown below: • Lack of consistent network initialization schemes: The initialization of neural networks has a crucial role on the effectiveness of gradient descent algorithms. Common initialization schemes (e.g., Glorot [ 55 ]) not 3Algorithm 1 Training pipeline of physics-informed neural networks 1. Non-dimensionalize the PDE system (2.1). 2. Represent the PDE solution by a multi-layer perceptron network (MLP) uθ with Fourier feature embeddings and random weight factorization. In general, we recommend using tanh activation and initialized using the Glorot scheme. 3. Formulate the weighted loss function according to the PDE system:
|
||
|
||
L(θ) = λic Lic (θ) + λbc Lbc (θ) + λr Lr (θ), (2.9) where Lic (θ) and Lbc (θ) are defined in (2.6), (2.7) respectively, and
|
||
|
||
Lr (θ) = 1
|
||
|
||
M
|
||
|
||
> M
|
||
|
||
X
|
||
|
||
> i=1
|
||
|
||
wiLir (θ). (2.10) Here we partition the temporal domain into M equal sequential segments and introduce Lir to denote the PDE residual loss within the i-th segment of the temporal domain. 4. Set all global weights λic , λ bc , λ r and temporal weights {wi}Mi=1 to 1. 5. Use S steps of a gradient descent algorithm to update the parameters θ as:
|
||
|
||
for n = 1 , . . . , S do
|
||
|
||
(a) Randomly sample {xiic }Nic
|
||
|
||
> i=1
|
||
|
||
, {tibc , xibc }Nbc
|
||
|
||
> i=1
|
||
|
||
and {tir , xir }Nr
|
||
|
||
> i=1
|
||
|
||
in the computational domain and evaluated each loss terms Lic , Lbc and {L ir }Mi=1 .(b) Compute and update the temporal weights by
|
||
|
||
wi = exp −ϵ
|
||
|
||
> i−1
|
||
|
||
X
|
||
|
||
> k=1
|
||
|
||
Lkr (θ)
|
||
|
||
!
|
||
|
||
, for i = 2 , 3, . . . , M. (2.11) Here ϵ > 0 is a user-defined hyper-parameter that determines the "slope" of temporal weights.
|
||
|
||
if n mod f = 0 then
|
||
|
||
(c) Compute the global weights by
|
||
|
||
ˆλic = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lic (θ)∥ , (2.12)
|
||
|
||
ˆλbc = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lbc (θ)∥ , (2.13)
|
||
|
||
ˆλr = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lr (θ)∥ , (2.14) where ∥ · ∥ denotes the L2 norm. (d) Update the global weights λ = ( λic , λ bc , λ r ) using a moving average of the form
|
||
|
||
λnew = αλ old + (1 − α)ˆλnew . (2.15) where the parameter α determines the balance between the old and new values
|
||
|
||
end if
|
||
|
||
(e) Update the parameters θ via gradient descent
|
||
|
||
θn+1 = θn − η∇θ L(θn) (2.16) (2.17)
|
||
|
||
end for
|
||
|
||
The recommended default values for hyper-parameters are as follows: f = 1 , 000 , α = 0 .9, γ = 1 .0, ϵ = 1 .0. Please note that we freeze the back-propagation of the weights wi’s and λi’s with respect to network parameters θ.4only prevent vanishing gradients but also accelerate training convergence. A critical assumption for these initialization methods is that input variables should be in a moderate range, such as having a zero mean and unit variance, which enables smooth and stable forward and backward propagation. To satisfy this assumption, we propose using non-dimensionalization to scale the input and output variables so that they are of order one. • Mitigating the disparities in variable scales: If input and output variables have different scales, some can dominate over others, leading to unbalanced contributions in the model training, therefore hindering the learning of meaningful correlations between them. Non-dimensionalization, which scales variables to have similar magnitudes and ranges, can help to reduce this discrepancy and facilitate model training. • Improving convergence: If the variables are not properly scaled, the optimization algorithm may have to take very small steps to adjust the weights for one variable while large steps for another variable. This may result in a slow and unstable training process. Through non-dimensionalization, the optimizer can take more consistent steps, yielding faster convergence and better performance. While non-dimensionalization is an indispensable pre-processing step, it is not a “silver bullet” that can resolve all issues in training PINNs. One of the main differences between PINNs and conventional deep learning tasks is the minimization of PDE residuals, which introduces additional difficulties in optimization process. Even if all variables are properly scaled via non-dimensionalization, the scale of PDE residuals can still vastly differ from the scale of the latent solution function, leading to a considerable discrepancy in the scale of different loss terms. Therefore, it is important to carefully inspect and re-scale the loss terms that define the PINNs objective. In section 5.2, we introduce two self-adaptive loss weighting schemes based on the magnitude of back-propagated gradients and Neural Tangent Kernel (NTK) theory. We will show that these methods can automatically balance the interplay between different loss terms during training and lead to more robust model performance.
|
||
|
||
# 4 Network Architecture
|
||
|
||
In this section, we delve into the selection of suitable network architectures for training PINNs. We begin by providing a brief overview of multi-layer perceptrons, along with common hyper-parameter choices, activation functions, and initialization schemes. Then, we discuss random Fourier feature embeddings, a simple yet effective technique that enables coordinate MLPs to learn complex high frequency functions. Finally, we introduce random weight factorization, a simple drop-in replacement of dense layers that has been shown to consistently accelerate training convergence and improve model performance.
|
||
|
||
4.1 Multi-layer Perceptrons (MLP)
|
||
|
||
We mainly use multi-layer perceptrons (MLPs) as a universal approximator to represent the latent functions of interest, which takes the coordinates of a spatio-temporal domain as inputs and predicts the corresponding target solution functions. Specifically, let x ∈ Rd be the input, g(0) (x) = x and d0 = d. A MLP fθ (x) is recursively defined by
|
||
|
||
f (l)(x) = W(l) · g(l−1) (x) + b(l), g(l)(x) = σ(f (l)
|
||
|
||
> θ
|
||
|
||
(x)) , l = 1 , 2, . . . , L, (4.1) with a final layer
|
||
|
||
fθ (x) = W(L+1) · g(L)(x) + b(L+1) , (4.2) where W(l) ∈ Rdl×dl−1 is the weight matrix in l-th layer and σ is an element-wise activation function. Here,
|
||
|
||
θ = |