From a90624b36d351c3a331ac3415732c3ecd7376e14 Mon Sep 17 00:00:00 2001
From: wassname <1103714+wassname@users.noreply.github.com>
Date: Fri, 6 Mar 2026 11:48:41 +0800
Subject: [PATCH] feat(pinn): add pinn/ sub-skill with SKILL.md and evidence

SKILL.md: 478-line PINN training best practices (complexity ladder,
nondim, architecture, optimization, loss design, sampling, property
mappings, ConFIG, domain decomposition).

docs/evidence/: 6 files -- krishnapriyan2021, sukumar2022, wang2022
causal, wang2022+2023 expert guides, Brunton youtube transcripts.

Missing evidence (to fetch): Wang 2001.04536 (gradient pathologies),
Rathore 2402.01868 (ICML loss landscape).

Author: wassname (https://github.com/wassname)
---
 pinn/SKILL.md                                 |  484 ++++
 .../docs/evidence/PINN youtube transcripts.md | 1603 +++++++++++
 .../krishnapriyan2021_failure_modes.md        | 1296 +++++++++
 .../evidence/sukumar2022_exact_bc_distance.md | 2541 +++++++++++++++++
 .../docs/evidence/wang2022_causal_training.md | 1897 ++++++++++++
 .../evidence/wang2022_expert_guide_pinn.md    | 1983 +++++++++++++
 .../evidence/wang2023_expert_guide_pinn.md    | 1490 ++++++++++
 7 files changed, 11294 insertions(+)
 create mode 100644 pinn/SKILL.md
 create mode 100644 pinn/docs/evidence/PINN youtube transcripts.md
 create mode 100644 pinn/docs/evidence/krishnapriyan2021_failure_modes.md
 create mode 100644 pinn/docs/evidence/sukumar2022_exact_bc_distance.md
 create mode 100644 pinn/docs/evidence/wang2022_causal_training.md
 create mode 100644 pinn/docs/evidence/wang2022_expert_guide_pinn.md
 create mode 100644 pinn/docs/evidence/wang2023_expert_guide_pinn.md

diff --git a/pinn/SKILL.md b/pinn/SKILL.md
new file mode 100644
index 0000000..6167ff6
--- /dev/null
+++ b/pinn/SKILL.md
@@ -0,0 +1,484 @@
+---
+name: pinn-debug
+description: "PINN (Physics-Informed Neural Network) training best practices and debugging. Use when building, debugging, or optimizing PINNs for PDEs, ODEs, or physics-constrained learning problems. Sub-skill of ml-debugging."
+---
+
+
+# PINN Training Best Practices
+
+https://gist.github.com/wassname/a68314f731002b32aa637c3d43a081c9/edit
+
+
+Consolidated from: NeuralPDE.jl tests/docs, ConFIG repo, Wang et al. 2021, Rathore et al. 2024 (ICML), ml_debug folklore, and practical experience with heat exchanger PINNs.
+
+Epistemic status: Patterns confirmed across multiple sources. Where sources disagree, noted. Paper claims marked with credence estimates.
+
+---
+
+## 0. Before You PINN: Complexity Ladder and Baselines
+
+PINNs are complex. Before trusting a PINN, work up the complexity ladder and compare against simpler baselines at each level. Each level should beat the previous one or you're adding complexity for nothing.
+
+**Make complexity pay rent.** If a fancier model doesn't improve on the simpler one, the added physics/architecture is either wrong, badly scaled, or unnecessary.
+
+### Complexity ladder (heat exchanger example)
+
+| Level | Model | Assumptions | What it tests |
+|---|---|---|---|
+| 0 | Persistence: y(t) = y(t-1) | None | Does model capture any dynamics? |
+| 1 | Exponential decay: T(t) = T_ss + (T_0 - T_ss) exp(-t/tau) | First-order response | Dominant thermal mode |
+| 2 | Linear state-space / DMDc: fit A, B via OLS on finite differences (or Dynamic Mode Decomposition with control) | Linear dynamics | Near-linear regime |
+| 3 | Pure data MLP (no physics loss) | NN can interpolate | If PINN doesn't beat this, physics constraint hurts |
+| 4 | Effectiveness-NTU closed form | 1D, constant Cp, steady state | Analytical sanity check |
+| 5 | 1D ODE with constant Cp (scipy solve_bvp) | 1D, constant Cp | Classical numerical baseline |
+| 6 | 1D ODE with enthalpy (scipy solve_bvp) | 1D, variable properties | Phase change handling |
+| 7 | 1D PINN with enthalpy | 1D, learned U | Differentiable model, parameter estimation |
+| 8 | 2D PINN with radial maldistribution | 2D, learned U + u_r | Full physics |
+
+At each level, brainstorm:
+- What assumption am I adding/relaxing?
+- What does this buy me (lower RMSE, new physics captured)?
+- What breaks if I simplify further?
+- **Which inputs does the model need to be conditional on?** If you want what-if scenarios ("what happens if I double the flow rate?"), the model must take that variable as an explicit input -- not bake it in as a constant. A model trained at fixed m_dot cannot generalize to m_dot * 2, even in-distribution. Decide the conditioning variables before building the model.
+
+**Try the simplest baselines first.** If scipy.optimize.minimize + solve_bvp fits your data to 2 K RMSE, you probably don't need a PINN.
+
+**Steady-state before dynamic.** If the problem has a transient component, solve the steady-state first. It's strictly simpler (no IC, no temporal causality), gives a reference solution, and the PINN for transient can be initialized from the steady-state solution. Many process systems spend >90% of operating time near steady-state anyway.
+
+Report RMSE, MAE, and trajectory plots for each baseline alongside the PINN. Physics-informed models should win on extrapolation and long-horizon rollout even if in-sample fit is similar.
+
+**Train/test split**: Use temporal split (not random). First 75% train, last 25% test. Random splitting leaks temporal correlation.
+
+---
+
+## 1. Nondimensionalization
+
+Nondimensionalize the governing equations *before* feeding them to the network. Raw SI units (T in Kelvin, h in J/kg, z in meters) create loss terms with wildly different magnitudes -- this is exactly the multi-scale problem that adaptive weighting tries to fix downstream. Nondimensionalizing fixes it at the source.
+
+**Recipe:**
+1. Pick characteristic scales: T_ref, L_ref, h_ref = Cp_ref * T_ref, etc.
+2. Define dimensionless variables: z* = z/L, T* = (T - T_cold)/(T_hot - T_cold), etc.
+3. Substitute into the PDE, cancel dimensions.
+4. The resulting coefficients are O(1) dimensionless groups (NTU, Biot, etc.).
+
+This also makes the input domain ~[0,1] naturally, which is what NeuralPDE.jl examples assume but never say why.
+
+If you can't nondimensionalize cleanly (unknown material properties, mixed units), at minimum z-score each input/output channel so the network sees zero-mean unit-variance data.
+
+> Rathore et al. 2024: "The condition number grows polynomially with nres" -- but this is in raw units. Nondimensionalization reduces the effective condition number by making all PDE coefficients O(1).
+> Source: https://arxiv.org/abs/2402.01868, Section 5, Theorem 8.4
+
+---
+
+## 2. Architecture
+
+From NeuralPDE.jl tests/docs + Wang et al. 2021:
+
+- **Depth**: 2-3 layers most common; 4 sometimes; 5 is rare. Deeper doesn't help and hurts conditioning.
+- **Width**: 16-64 (start at 32). Occasionally 128 for harder problems.
+- **Activations**: tanh dominates. SiLU/GeLU sometimes. sigmoid for some ODEs. Final layer linear.
+- **Multi-output**: One chain per dependent variable: `[Chain(Dense(in, n, act), Dense(n, n, act), Dense(n, 1)) for _ in 1:k]` or a single shared trunk with separate heads.
+- **No**: Fourier features (unless spectral bias is a problem), batch norm, dropout, skip connections, attention. These are for data-rich regimes; PINNs are data-poor + physics-rich.
+- **Precision**: float64 for numerical stability. Convert model: `model.double()` or `|> f64` in Julia.
+- **Init**: Glorot uniform (Xavier), zero biases. Standard.
+
+**Modified MLP** (Wang et al. 2021, credence ~70%):
+> Wang et al. propose a modified MLP with multiplicative interactions: `σ(Wz + b) * U + (1 - σ(Wz + b)) * V` where U, V are linear projections of the input. Authors claim this reduces Hessian stiffness. "Greatly enhances predictive accuracy" for nonlinear PDEs.
+> Source: https://arxiv.org/abs/2001.04536, Section 2.6
+> Evidence: 49x improvement on Helmholtz, 64x on Klein-Gordon. But only tested by the proposing authors; no independent replication found.
+
+**Random Weight Factorization (RWF)** (arXiv 2210.01274, credence ~60%):
+> Factorize each neuron's weight vector as w = s * w_unit, where s is a trainable scalar and w_unit is the unit-normalized direction. This changes the optimization geometry so the loss surface has better-conditioned local minima. "Predictions obtained by RWF are in excellent agreement with ground truth, while other weight parameterizations result in poor or non-physical approximations."
+> Source: https://arxiv.org/abs/2210.01274
+> Used in the PirateNet architecture alongside causal training, sequence-to-sequence, and Fourier features. Simple to implement as a custom parameterization on Linear layers.
+> Credence: plausible mechanism, but proposing-author result; check jaxpi repo for independent adoption.
+
+**PirateNet** (jaxpi library, credence ~55%): Bundles RWF + causal time-marching + seq2seq + Fourier features into one architecture. Good reference implementation when you want all the tricks.
+> Source: https://github.com/PredictiveIntelligenceLab/jaxpi
+
+**Symmetry-enforcing architectures** (Julia Ling et al., credence ~75%):
+> Instead of data-augmenting with transformed copies, bake symmetries directly into the architecture so every model in the function space is automatically invariant/equivariant. For turbulence closure (Reynolds stress from velocity gradients), custom tensor layers enforce Galilean invariance by construction. "The Galilean invariant model is more accurate than the other models" and generalizes better across flow configurations.
+> Source: Ling et al. 2016, "Machine learning strategies for systems with invariance properties." J. Comput. Phys. https://doi.org/10.1016/j.jcp.2016.05.003
+> Lecture: Brunton, S. "AI/ML+Physics Part 3 - Designing an Architecture." https://www.youtube.com/watch?v=fiX8c-4K0-Q
+> Key distinction: invariance (output unchanged by transformation, e.g., energy is frame-invariant) vs equivariance (output transforms same way as input, e.g., stress tensor rotates with frame). Equivariant architectures are more general. If your PDE has known symmetries (translation, rotation, scaling), enforce them architecturally rather than hoping the optimizer discovers them.
+> **Caveat**: This works best for local closure terms (Reynolds stress, turbulence models) and unbounded/periodic domains where the global symmetry holds everywhere. If your domain has boundary conditions that break the symmetry (e.g., a wall breaks rotational invariance), enforcing the symmetry globally in the architecture will prevent the solution from satisfying the BCs -- the architecture will be fighting the problem. In bounded domains, use symmetry-enforcing architectures only for terms where the symmetry genuinely holds (e.g., the constitutive relation), not for the full solution field. Libraries like `e3nn` implement this but add significant computational overhead.
+
+---
+
+## 3. Optimization
+
+The loss landscape of PINNs is ill-conditioned. First-order methods (Adam) converge slowly near the solution. Second-order methods (L-BFGS) can get stuck on strong Wolfe line search. The solution: combine them.
+
+### Recommended workflow
+
+```
+1. Adam (lr grid search: {1e-1, 1e-2, 1e-3}) for 1k-11k iterations
+   - Escapes saddle points, explores broadly
+   - Tolerates noisy gradients
+2. L-BFGS (lr=1.0, memory=100) until convergence stalls
+   - Preconditions the Hessian, reduces condition number by ~1000x
+   - Often stalls: cannot find step size satisfying strong Wolfe conditions
+3. (Optional) NysNewton-CG (NNCG) for final polish
+   - 1.4-4.3x further improvement in L2RE
+   - But 5-300x slower per iteration than L-BFGS
+```
+
+> Rathore et al. 2024 (ICML, credence ~80%): "Adam+L-BFGS attains 14.2x smaller L2RE than Adam on convection and 6.07x smaller than L-BFGS on wave." Tested on 3 PDEs (convection, reaction, wave), 5 seeds, widths 50-400.
+> Source: https://arxiv.org/abs/2402.01868, Table 1
+> Code: https://github.com/pratikrathore8/opt_for_pinns
+
+**Alternative from NeuralPDE.jl**: Stepwise LR decrease with warm-start:
+```
+Adam(0.1) -> Adam(0.01) -> Adam(0.001)
+# Remake problem with u0 = res.u at each stage
+```
+
+### Key findings on loss landscape
+
+**Near-zero loss required** (Rathore et al., credence ~85%):
+> "A loss of 1e-3 yields L2RE ~ 1e-1, but decreasing loss by 100x to 1e-5 yields L2RE ~ 1e-2."
+> Implication: you need to drive the loss very low for useful accuracy. Don't stop at "loss looks flat."
+
+**L-BFGS stalls but gradient is still useful** (Rathore et al., credence ~80%):
+> "L-BFGS stops without reaching a critical point: gradient norm is ~1e-2 to 1e-3. The gradient still contains useful information."
+> Cause: strong Wolfe line search fails, step size goes to zero.
+> Fix: switch to NNCG (Armijo only) or restart with different LR.
+
+**Condition number grows with nres** (Rathore et al., credence ~70%):
+> Theorem 8.4: condition number = Omega(nres^alpha) where alpha > 1/2.
+> With typical nres = 1e3 to 1e4, condition numbers > 1e4 are expected.
+> Implication: more collocation points doesn't just mean more compute -- it makes the optimization harder.
+
+---
+
+## 4. Loss Design and Gradient Pathologies
+
+### The core problem
+
+The PINN loss has multiple terms (PDE residual, BCs, ICs, data) with different gradient magnitudes. The PDE residual term dominates because it involves differential operators that amplify gradients.
+
+**L1 vs L2 norm choice** (Brunton 2023):
+> L2 norm (MSE) on residuals: default; promotes smooth, low-frequency solutions. L1 norm (MAE) on residuals: more robust to outlier collocation errors and sharp gradients (shocks) since it doesn't square-penalize large pointwise residuals. This is distinct from L1 *regularization on equation coefficients*, which is what SINDy and sparse equation discovery use to promote parsimony (few active terms). Don't conflate the two: L1 residual = robust fitting; L1 coefficient regularization = sparse model selection. For standard PINNs with a known PDE, L2 is correct. L1 residual loss is worth trying if you have shocks or suspect outlier collocation points.
+> Source: Brunton, S. "AI/ML+Physics Part 4 - Crafting a Loss Function." https://www.youtube.com/watch?v=3SNkQ8jhKXc
+
+> Wang et al. 2021 (credence ~80%): "We observe that the gradient of the PDE residual loss is several orders of magnitude larger than the gradient of the boundary/initial condition losses." Demonstrated via histograms of per-parameter gradient magnitudes.
+> Source: https://arxiv.org/abs/2001.04536, Figures 2-3
+
+**Consequences:**
+- BC/IC losses are undertrained (gradient signal drowned out)
+- Solution satisfies the PDE approximately but violates BCs
+- This looks like "the PINN found a wrong solution" but it's really "the optimizer is ignoring the BCs"
+
+### Hessian stiffness
+
+> Wang et al. 2021: Hessian eigenvalue ratios ~1e5 (max eigenvalue / min eigenvalue). This is the definition of an ill-conditioned problem.
+> Source: https://arxiv.org/abs/2001.04536, Figures 4-5
+
+### Solutions (in order of preference)
+
+**1. Nondimensionalize first** (see Section 1). This is the cheapest fix and addresses the root cause.
+
+**2. Learning rate annealing** (Wang et al. 2021, credence ~75%):
+> Adaptively weight each loss term inversely proportional to its gradient magnitude. EMA of gradient statistics for stability.
+> Source: https://arxiv.org/abs/2001.04536, Algorithm 1
+> NeuralPDE.jl implements this as `GradientScaleAdaptiveLoss`.
+
+**3. Conflict-free gradient methods** (ConFIG, credence ~70%):
+> Instead of summing loss gradients (which can cancel), project them into a conflict-free direction.
+> ConFIG: unit-normalize per-loss gradients, solve least-squares for combined direction, rescale by projection lengths.
+> Source: https://tum-pbs.github.io/ConFIG/
+> Key: must compute per-loss gradients separately (zero_grad + backward for each). Summing raw losses defeats the purpose.
+> M-ConFIG: momentum variant, updates only one loss's gradient per step. Use with SGD, not Adam (momentum conflict).
+
+**4. Don't use multiple losses if you can avoid it.** A single well-posed loss is always better than a weighted sum. Can you reformulate BCs as hard constraints (e.g., multiply network output by a function that satisfies BCs)? Can you use a penalty method that naturally balances?
+
+**4b. Constrained optimization instead of penalized** (Brunton 2023, credence ~80%):
+> Standard PINNs use penalized (soft) constraints: add physics as a loss term. The alternative is constrained optimization: minimize data error while exactly satisfying the physics constraints. "With a loss function you're not exactly satisfying your constraints. With constrained optimization you are."
+> Source: Brunton, S. "AI/ML+Physics Part 5 - Employing an Optimization Algorithm." https://www.youtube.com/watch?v=T4iJ10TAIMg
+> Physics-informed DMD (Baddoo et al. 2021) is the cleanest example: restrict the DMD matrix to a symmetry-preserving manifold (Hermitian, symplectic, etc.) via the Procrustes problem. KKT closed-form solutions exist because DMD is linear in its parameters -- the constraint is linear in both the output and the parameters simultaneously.
+> Baddoo et al. 2021. "Physics-informed dynamic mode decomposition." Proc. R. Soc. A. https://arxiv.org/abs/2112.04307
+> **Critical caveat for PINNs**: A BC like u(0)=0 is affine in the output u, but it is nonlinear in the NN weights theta. Closed-form KKT does NOT apply to neural network parameters. For NN-based PINNs, the two options for hard constraints are: (a) architectural -- multiply output by a distance function that satisfies the BC (Section 4 item 8), or (b) Augmented Lagrangian Methods (ALM), which are iterative and substantially more complex than Adam. Constrained optimization is most practical for linear models (DMD, SINDy, linear state-space) where the parameters enter linearly.
+
+**5. Curriculum regularization** (Krishnapriyan et al. 2021 NeurIPS, credence ~80%):
+> When the PINN fails on hard PDE regimes (high convection coefficient, strong reaction), don't start there. Start with easy parameters (small coefficient), train to convergence, then warm-start and increase to the target regime. 1-2 orders of magnitude improvement over naive training.
+> "The curriculum training approach achieves significantly better errors, as well as lower variance in the error." (From Figure E.2 showing 10 seeds)
+> Source: https://arxiv.org/abs/2109.01050, Sections 5.1 and Figure 4
+> Evidence: evidence/krishnapriyan2021_failure_modes.md
+
+**6. Sequence-to-sequence (time-marching)** (Krishnapriyan et al. 2021, credence ~75%):
+> For time-dependent PDEs: train on a short time window, predict next state, step forward. Don't train on full space-time at once. "Posing the problem as seq2seq learning results in significantly lower error. The difference is particularly striking for reaction and reaction-diffusion cases, where seq2seq decreases error by almost two orders of magnitude."
+> Source: https://arxiv.org/abs/2109.01050, Section 5.2
+> NeuralPDE.jl calls this time-marching; see `WeightedIntervalTraining`.
+> Note: these failures are not due to limited NN expressivity -- the architecture has enough capacity. The problem is optimization difficulty from the soft PDE constraint.
+
+**7. Causal training** (Wang et al. 2022, credence ~85%):
+> Standard PINNs trained by gradient descent are implicitly biased toward minimizing residuals at *later* times before even fitting the initial conditions -- violating physical causality. The NTK analysis shows the residual at time t is influenced more by residuals at later t' > t than earlier ones. This makes PINNs fail on chaotic/turbulent systems.
+> Fix: weight each temporal residual point by wi = exp(-epsilon * sum_j<i R_j(theta)), where R_j is the accumulated residual before time i. This forces earlier times to converge first before the loss "turns on" at later times.
+> "10-100x improvements in accuracy compared to competing approaches. First time PINNs succeeded on chaotic Lorenz, Kuramoto-Sivashinsky, and 2D Navier-Stokes in turbulent regime."
+> Source: https://arxiv.org/abs/2203.07404, Abstract and Section 3
+> Evidence: evidence/wang2022_causal_training.md
+> Key difference from seq2seq/curriculum: causal weighting works within a single continuous training, without requiring separate time windows or changing the PDE coefficients. Can be combined with seq2seq for further gains.
+> Sensitivity: epsilon controls the steepness of the causal weights. Too small = residuals at later times turn on too early. Too large = training stalls on early time steps. Anneal epsilon during training.
+
+**8. Hard BCs via distance functions** (Sukumar & Srivastava 2022 CMAME, credence ~80%):
+> Instead of penalizing BC violations (soft), multiply the PINN output by a distance function phi(x) that is zero on the boundary. Then u(x) = phi(x) * NN(x) satisfies BCs exactly by construction.
+> "We eliminate modeling error associated with the satisfaction of boundary conditions. The sole contribution to the loss function is from the residual error at interior collocation points."
+> "The proposed approach consistently outperforms a standard PINN-based collocation method."
+> Source: https://arxiv.org/abs/2104.08426, Abstract and Section 1
+> Evidence: evidence/sukumar2022_exact_bc_distance.md
+> For our heat exchanger: u_{r,tube}(z,r) = r(R-r) * NN(z,r) enforces zero radial velocity at r=0 and r=R by construction. Same principle.
+
+### Known failure modes
+
+**Degenerate U -> 0**: When learning a heat transfer coefficient U jointly with the network, the optimizer may find U -> 0 (no heat transfer, flat profiles). This satisfies the PDE trivially.
+- Fix: bound U in [100, 5000], initialize near a physically reasonable value (~500).
+- Fix: penalize |dT/dz| being too small (the network should predict temperature change).
+
+**Sign errors in counterflow**: The counterflow BVP in z-coordinates has both fluids' enthalpies decreasing in z:
+  - dh_t/dz = -U*A'*(T_t - T_s)/m_dot_t  (tube cools, < 0 since T_t > T_s)
+  - dh_s/dz = -U*A'*(T_t - T_s)/m_dot_s  (same sign! shell also decreases in z, but its flow is -z so it warms)
+  The counterflow is encoded in BCs (h_t at z=0, h_s at z=L), not in opposing ODE signs.
+  Common mistake: putting dh_s/dz = +q gives dh_s/dz > 0 (shell hot at z=0 exit, cold at z=L entry -- wrong).
+
+**NN vs scalar parameter**: When a scalar (U) and a NN (temperature field) are optimized jointly, the NN is easier to update (more parameters, more gradient signal). The scalar gets stuck. Fix: use different learning rates, or estimate U separately first.
+
+---
+
+## 5. Sampling and Collocation
+
+From NeuralPDE.jl:
+
+| Strategy | When to use |
+|---|---|
+| GridTraining(dx) | Small domains, smooth solutions |
+| StochasticTraining(N) | Default, re-sample each epoch |
+| QuasiRandomTraining(N) | Better coverage (Latin hypercube, Sobol) |
+| QuadratureTraining | Smooth, low-dimensional PDEs |
+| WeightedIntervalTraining | Stiff ODEs, emphasize hard sub-intervals |
+
+**Adaptive sampling** (residual-based): Sample more points where PDE residual is large. Helps with sharp features (phase change, shocks). But adds complexity.
+
+**Smoothness bias**: Monte Carlo approximation of the integral loss naturally biases solutions toward smooth functions (low-frequency modes dominate). This is partly regularization (helpful) and partly harmful (misses sharp features like phase-change fronts, shocks). Mitigations: residual-adaptive sampling, curriculum regularization starting with smooth regime, Fourier feature embeddings to reduce spectral bias.
+
+For 2D problems with radial integrals: use a regular grid in r (including r=0 and r=R) so torch.trapezoid / numpy.trapz can compute accurate integrals for conservation checks.
+
+**How many points?** Start with ~1000 collocation points. More helps accuracy but worsens conditioning (Section 3). The sweet spot depends on the problem.
+
+---
+
+## 6. Property Mappings (EoS for PINNs)
+
+For heat exchangers with phase change, the T(h) mapping from REFPROP/GERG-2008 must be made differentiable.
+
+**Recipe:**
+1. Generate REFPROP data: T vs h at constant P, dense enough near the phase boundary (~200 points over T = [100, 300] K).
+2. Fit a cubic spline to get ~20 minimal knots (reject redundant knots where the curve is nearly linear).
+3. Build PCHIP on those knots for monotonicity guarantee (cubic spline can wiggle; PCHIP cannot).
+4. Wrap for PyTorch autograd:
+   ```python
+   class PchipFunction(torch.autograd.Function):
+       @staticmethod
+       def forward(ctx, h):
+           T = pchip_interp(h.detach().numpy())  # scipy
+           ctx.save_for_backward(h)
+           return torch.tensor(T, dtype=h.dtype)
+
+       @staticmethod
+       def backward(ctx, grad_output):
+           h, = ctx.saved_tensors
+           dTdh = pchip_interp.derivative()(h.detach().numpy())  # scipy
+           return grad_output * torch.tensor(dTdh, dtype=h.dtype)
+   ```
+5. Use float64 throughout. float32 loses precision near the phase boundary where dT/dh is small.
+
+**Why h -> T and not T -> h?** The h(T) direction has a steep region near phase change where dh/dT = Cp -> infinity. The inverse T(h) is smooth and monotone everywhere. Always solve in h space, convert to T only for losses and visualization.
+
+---
+
+## 7. Initial Conditions and Multi-Episode Training
+
+IC is a loss term, but in practice ICs from real plant data are noisy and uncertain.
+
+**Options (in order of complexity):**
+1. Hard-code IC from first observed data point. Simplest; biased if sensor lags.
+2. Learnable IC: optimize IC scalars jointly with the network. Adds ~n_states parameters. Works well with L-BFGS polish.
+3. Soft IC loss: penalize `||u(t=0) - u_obs(0)||^2` as one loss term. Lets optimizer trade off IC vs physics fit.
+
+**Multi-episode training** (e.g., ~10 cold restarts over a year):
+- Each episode needs its own IC.
+- Common pattern: learn per-episode IC vector `ic_k` while sharing the physics network across episodes.
+- Warm-start trick: if you have a pure-data pre-fit (Level 3 baseline), initialize PINN data-loss weights from that fit.
+
+---
+
+## 8. Validation and General ML Debugging
+
+### PINN-specific checks
+
+1. **Energy conservation**: |Q_in - Q_out| / Q_total < 1%. For 2D: check per axial slice.
+2. **Residual magnitude**: PDE residual should be near zero everywhere, not just on average. Plot residual vs (z, r).
+3. **BC satisfaction**: Check that BCs are actually satisfied. The BC loss being small doesn't mean the BC is satisfied if the residual loss dominates the gradient.
+4. **Physical reasonableness**: Does the temperature profile make physical sense? Is T_tube monotonically increasing? Is T_shell monotonically decreasing? Are there unphysical oscillations?
+5. **Compare to classical solver**: Solve the same problem with scipy solve_bvp. If the PINN disagrees, the PINN is probably wrong.
+
+### General ML debugging (from ml_debug folklore)
+
+These apply to PINNs too:
+
+**Work in order, don't skip to hyperparameters:**
+1. Verify components in isolation (forward pass shapes, loss computation by hand, data pipeline)
+2. Get signs of life on a toy problem (1D, known solution, constant Cp)
+3. Overfit to training data first. If you can't overfit, you can't generalize.
+4. Log everything: losses per component, gradient norms per module, parameter norms, activation stats
+5. Numerical hygiene: `assert torch.isfinite(loss)`, `log(x.clamp(min=1e-8))`, `x / (std + 1e-5)`
+
+**Symptom table** (adapted for PINNs):
+
+| Symptom | Likely cause |
+|---|---|
+| Loss stuck from the start | LR too low, bad init, wrong loss function, nondimensionalization missing |
+| Loss decreases then explodes | LR too high, numerical instability in T(h) near phase boundary |
+| Loss NaN | log(0), 0/0 in property mapping, temperature outside EoS range |
+| Physics loss low but BCs violated | Gradient imbalance (Section 4), need loss weighting |
+| U converges to 0 or bound | Degenerate solution (Section 4), check sign convention |
+| Model outputs constant | Dead neurons, vanishing gradients, U -> 0 |
+| Good train, bad test RMSE | Overfitting to collocation points, not enough physics constraint |
+| PINN worse than pure-data MLP | Wrong equations, bad scaling, or physics constraint fighting the data |
+
+**Gradient clipping masks problems.** Always log the pre-clip norm. If it's constantly triggered, fix the root cause (nondimensionalize, reduce LR, check for numerical issues).
+
+**Assume you have a bug.** Most of the time when a PINN doesn't work, it's a bug (sign error, wrong BC, missing factor of 2pi), not a hyperparameter issue.
+
+### Loss surface analysis (from ml_debug)
+
+When a loss term isn't working, don't guess -- visualize:
+1. Grid over 1-2 key axes (e.g., U and a temperature), compute loss + gradient at each point.
+2. Plot contour + quiver (negative gradient = optimization direction).
+3. Look for: dead zones (zero gradient), saddle points, competing gradients, scale imbalance.
+
+This takes 5 minutes and saves hours.
+
+---
+
+## 9. Multi-Loss Training Details (ConFIG)
+
+If you must use multiple loss terms (and in PINNs you usually must):
+
+### ConFIG (recommended over naive summation, credence ~70%)
+
+```python
+# Per-loss gradient capture (ConFIG requires this)
+for loss_fn in [L_pde, L_bc, L_data]:
+    optimizer.zero_grad()
+    loss = loss_fn(model, collocation_pts)
+    loss.backward()
+    grads.append(get_gradient_vector(model))
+
+# Combine via conflict-free direction
+combined_grad = ConFIG_update(grads, use_least_square=True)
+apply_gradient_vector(model, combined_grad)
+optimizer.step()
+```
+
+**Key rules:**
+- Don't pass summed loss into ConFIG. That defeats the purpose.
+- Use `lstsq` not `pinv` (more stable).
+- If some parameters don't participate in a loss, use `none_grad_mode='zero'`.
+- M-ConFIG (momentum variant): use SGD, NOT Adam. Momentum conflict.
+
+### Alternatives
+
+| Method | Pros | Cons |
+|---|---|---|
+| Naive sum | Simple | Gradient conflict, dominant terms drown others |
+| GradientScaleAdaptiveLoss | Built into NeuralPDE.jl | Heuristic, EMA lag |
+| ReLoBRaLo | Effective on benchmarks | More complex |
+| ConFIG | Theoretically grounded, consistent wins | Per-loss backward required (2-3x cost) |
+| Hard constraints | Eliminates BC loss entirely | Not always possible |
+
+> ConFIG authors claim superiority over PCGrad and Adam baseline on Burgers, Schrodinger, Kovasznay, Beltrami. UPGrad mentions ConFIG but ConFIG does not cite UPGrad.
+> Source: https://tum-pbs.github.io/ConFIG/
+
+---
+
+## 10. Domain Decomposition (XPINNs)
+
+For large or geometrically complex domains, split into subdomains each with a local PINN. Interface conditions enforce continuity between subdomains.
+
+> Jagtap et al. 2020. "Extended physics-informed neural networks (XPINNs): A generalized space-time domain decomposition based deep learning framework for nonlinear partial differential equations." Commun. Comput. Phys. https://arxiv.org/abs/2005.11025
+> Credence ~70%: Multiple citations, implemented in DeepXDE. Enables parallelization; each subdomain network is smaller and easier to optimize.
+> Key: interface residuals must be added as additional loss terms. Continuity of u and its normal derivative across interfaces.
+> Relevant for our heat exchanger: could split axially (cold end / hot end) if the solution has different character in different regions.
+
+**Physics reparameterization**: Instead of learning u(x,t) from scratch, solve for the deviation from a known approximate solution: u(x,t) = u_approx(x,t) + delta(x,t), where the NN learns delta. If u_approx is good, delta is small and the landscape is smoother. The approximate solution can be the steady-state solution, a linearization, or a coarse numerical solution.
+
+## 11. PIKANs (Kolmogorov-Arnold Networks for PINNs)
+
+> Toscano et al. 2024: PIKANs "lead to smaller models and may also contribute to lowering computational cost while maintaining good accuracy."
+> Source: https://arxiv.org/abs/2410.13228
+> Credence ~40%: New, no independent replication. Other authors focus on improving PINNs within the MLP framework, not validating PIKANs as an alternative. Interesting but unproven.
+
+---
+
+## References
+
+### Comprehensive guides (start here)
+- Wang et al. 2023. "An Expert's Guide to Training Physics-Informed Neural Networks." arXiv:2308.08468. https://arxiv.org/abs/2308.08468
+  - Key: most thorough practical guide. Covers architecture (modified MLP), sampling, loss weighting, Fourier features, causal training, code. By the same group as 2021 paper.
+- Wang et al. 2021. "Understanding and mitigating gradient pathologies in physics-informed neural networks." SIAM J. Sci. Comput. https://arxiv.org/abs/2001.04536
+  - Key: gradient imbalance diagnosis, learning rate annealing, modified MLP architecture
+
+### Loss landscape and optimization
+- Rathore et al. 2024. "Challenges in Training PINNs: A Loss Landscape Perspective." ICML. https://arxiv.org/abs/2402.01868
+  - Key: Adam+L-BFGS, ill-conditioning from differential operators, near-zero loss required
+- Krishnapriyan et al. 2021. "Characterizing possible failure modes in physics-informed neural networks." NeurIPS. https://arxiv.org/abs/2109.01050
+  - Key: curriculum regularization, seq2seq, failure is optimization not expressivity
+
+### Training strategies
+- Wang et al. 2022. "Respecting causality for training physics-informed neural networks." J. Comput. Phys. https://arxiv.org/abs/2203.07404
+  - Key: causal weighting of temporal residuals, 10-100x improvement on chaotic systems, first PINN success on turbulence
+- Jagtap et al. 2020. "Extended Physics-Informed Neural Networks (XPINNs)." Commun. Comput. Phys. https://arxiv.org/abs/2005.11025
+  - Key: domain decomposition, parallelizable, implemented in DeepXDE
+
+### Hard constraints and boundary conditions
+- Sukumar & Srivastava 2022. "Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks." CMAME. https://arxiv.org/abs/2104.08426
+  - Key: distance-function trial functions, eliminates BC loss, consistently outperforms soft BCs
+- Lagaris et al. 1998. "Artificial neural networks for solving ordinary and partial differential equations." IEEE Trans. Neural Netw. doi:10.1109/72.712178
+  - Key: original paper on hard BCs via trial functions. Sukumar 2022 is the modern extension.
+
+### Architecture and parameterization
+- Wang et al. 2022. "Random Weight Factorization Improves the Training of Continuous Neural Representations." https://arxiv.org/abs/2210.01274
+  - Key: factorize w = s * w_unit, better local minima, used in PirateNet
+- Toscano et al. 2024. "From PINNs to PIKANs." https://arxiv.org/abs/2410.13228
+- PirateNet / jaxpi: https://github.com/PredictiveIntelligenceLab/jaxpi (bundles RWF + causal + seq2seq + Fourier)
+- Ling et al. 2016. "Machine learning strategies for systems with invariance properties." J. Comput. Phys. https://doi.org/10.1016/j.jcp.2016.05.003
+  - Key: tensor-layer architecture enforcing Galilean invariance by construction for turbulence closure; symmetry-via-architecture beats symmetry-via-augmentation
+
+### Constrained optimization and physics-informed DMD
+- Baddoo et al. 2021. "Physics-informed dynamic mode decomposition." Proc. R. Soc. A. https://arxiv.org/abs/2112.04307
+  - Key: restrict DMD to symmetry-preserving matrix manifolds (Hermitian, symplectic) via Procrustes problem; exactly satisfies conservation laws without penalty terms
+
+### Lecture series
+- Brunton, S. 2023. "AI/ML + Physics" YouTube lecture series. University of Washington.
+  - Part 1 (problem definition): https://www.youtube.com/watch?v=ARMk955pGbg
+  - Part 2 (training data): https://www.youtube.com/watch?v=g-S0m2zcKUg
+  - Part 3 (architecture): https://www.youtube.com/watch?v=fiX8c-4K0-Q
+  - Part 4 (loss function): https://www.youtube.com/watch?v=3SNkQ8jhKXc
+  - Part 5 (optimization): https://www.youtube.com/watch?v=T4iJ10TAIMg
+
+### Alternative formulations
+- Weinan E & Bing Yu 2018. "The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems." Commun. Math. Stat. https://arxiv.org/abs/1710.00211
+  - Key: energy minimization formulation instead of strong-form residuals. Better-conditioned for elliptic PDEs, can be easier to optimize since the loss is an energy (always positive, no cancellation).
+
+### Multi-loss training
+- Liu et al. 2024. "ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks." https://tum-pbs.github.io/ConFIG/
+- Bischof & Kraus. "ReLoBRaLo: Relative Loss Balancing with Random Lookback."
+- TorchJD (UPGrad + aggregation methods): https://torchjd.org/stable/docs/aggregation/
+
+### General ML debugging
+- Schulman 2017. "Nuts and Bolts of Deep RL Experimentation." http://joschu.net/docs/nuts-and-bolts.pdf
+- Andy Jones 2021. "Debugging RL, Without the Agonizing Pain." https://andyljones.com/posts/rl-debugging.html
+
+### NeuralPDE.jl
+- Architecture, sampling, adaptive loss patterns from tests and docs.
+- https://github.com/SciML/NeuralPDE.jl
diff --git a/pinn/docs/evidence/PINN youtube transcripts.md b/pinn/docs/evidence/PINN youtube transcripts.md
new file mode 100644
index 0000000..5d0350a
--- /dev/null
+++ b/pinn/docs/evidence/PINN youtube transcripts.md	
@@ -0,0 +1,1603 @@
+# Physics Informed Neural Networks (PINNs) [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/-zrY7P2dVC4/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=-zrY7P2dVC4&pp=ygUgcGh5c2ljcyBpbmZvcm1lZCBuZXVyYWwgbmV0d29ya3M%3D)
+
+> welcome back I'm Steve Brunton from the
+> University of Washington and today I'm
+> really excited to start a new lecture
+> series which is a part of a larger uh
+> set of boot camps and short courses on
+> the topic of physics informed machine
+> learning so this is one of the most
+> important topics in all of machine
+> learning because it's really at the
+> intersection of how do you leverage you
+> know existing physical knowledge to
+> improve machine learning systems and how
+> do you use machine learning to learn new
+> physics that humans have never written
+> down before okay so um you know the idea
+> of machine learning we talk about this
+> all the time we see the success stories
+> in the news in our daily lives and the
+> pace of progress is really you know
+> astounding in machine learning things
+> that were once only you know uh science
+> fiction are now science fact so this is
+> you know the Mona Lisa coming to life
+> from a single still image with machine
+> learning we also have you know uh
+> rapidly advancing generative AI so this
+> is an example of Dolly 2 where you can
+> put in a prompt like guitar monster and
+> it just generates this beautiful image
+> of a monster playing a guitar notice
+> that the guitar is also a monster which
+> I think is kind of cool and although
+> these are absolutely useful for things
+> like you know creating new art and
+> graphics and generating text and transla
+> languages all kinds of um you know
+> classic machine learning tasks there is
+> this growing world of how do we use
+> physics inform machine learning for
+> things like designing an entirely new
+> aircraft or a new super material that's
+> going to be used in the engine or in the
+> the wing or the fuselage of this
+> aircraft or any of the other um you know
+> engineering and physics tasks that we as
+> humans have things like you know
+> predicting and modeling and
+> understanding fluid uh flows and
+> turbulence which are again important for
+> Designing you know uh aircraft race cars
+> transport ships as well as wind turbines
+> and you know understanding climate
+> change and weather and things like that
+> things like Robotics and digital twins
+> uh for
+> autonomy uh and this is just a tiny tiny
+> glimpse of the many many applications of
+> machine learning and physics informed
+> machine learning in particular
+> in engineering and Natural Sciences so
+> if you you know want to be a part of
+> this uh Revolution if you want to change
+> the world if you want to design new
+> super materials that are going to allow
+> us to build better aircraft and better
+> wind turbines and um you know better
+> algorithms for robots and understanding
+> things like climate change physics and
+> for machine learning is going to become
+> an essential tool in Your Arsenal in the
+> future so that's what this lecture
+> series is about and this is the overview
+> lecture uh in that
+> series okay so zooming out we've talked
+> about this a lot before that machine
+> learning is essentially the process of
+> building models from data using
+> optimization and regression techniques
+> now we've been building models from data
+> with optimization and regression for
+> decades if not centuries if not
+> Millennia as humans so some of our
+> earliest models of the planets were
+> essentially models from observational
+> data that we did some kind of a crude
+> regression uh or optimization to kind of
+> understand those
+> models but today we have much more data
+> it's much more we have much more
+> advanced optimization algorithms and so
+> that's one of the reasons machine
+> learning has been advancing so
+> rapidly now that's classic machine
+> learning what we're interested in
+> talking about today and in this series
+> is physics informed machine learning
+> which is how do we build physical models
+> from data using those same optimization
+> and regression techniques or at least
+> building on those existing optimization
+> uh and regression
+> Frameworks and so this entire series is
+> focused on this intersection of
+> artificial intelligence and machine
+> learning with physics okay I think too
+> often researchers in the machine
+> learning field kind of forget that we
+> actually have uh centuries of knowledge
+> of physics we have tons of tons of
+> knowledge of how physical systems
+> actually work and often times we throw
+> that out when we build these machine
+> learning models but in the last five or
+> 10 years it started to become more clear
+> that we actually need to incorporate
+> physics into the machine learning
+> process to get the advanced performance
+> we need on really complicated systems
+> like in engineering uh or in Natural
+> Sciences and you know biology things
+> like that and so there's kind of two big
+> aspects that are going to going to come
+> up over and over again two big themes in
+> this series one is we can enforce
+> physics into machine learning models we
+> essentially can take known physics or
+> partially known physics things like
+> symmetries conservation laws invariant
+> things like that and we can bake those
+> in to our machine learning models to
+> make them much more performant so we can
+> learn better models that generalize
+> better with less training data if we
+> enforce physics in to the models and the
+> flip side of that is we can often and
+> increasingly are are able to use machine
+> learning techniques to discover entirely
+> new physics with our measurement data so
+> if I have measurement data from you know
+> galactic motion or a plasma or you know
+> some biological system I might be able
+> to use these machine learning methods uh
+> that are being developed in other fields
+> of computer science and I might be able
+> to use those methods to discover
+> new physical models that we've never
+> been able to write down before that
+> humans haven't been able to discover uh
+> because of either their complexity or or
+> some other aspect um you know some
+> complexity of the data and so that's a
+> really exciting aspect too we can start
+> getting things like ordinary and partial
+> differential equations that describe uh
+> really really complex systems that
+> humans haven't been able to describe
+> using classical methods okay so those
+> are the two big themes we're going to
+> enforce or bake in physics into the
+> machine learning process and
+> simultaneously often times we're going
+> to use machine learning to discover new
+> physics and we're going to find that
+> these are kind of dual uh
+> problems and here's a really really
+> simple example I like to show just to
+> get us warmed up what do we mean by
+> physics inform form machine learning
+> this is a pendulum uh mounted on the
+> wall in my lab and so it's just moving
+> back and forth there's a little uh
+> accelerometer on the tip and a little
+> LED so you can take a video of this
+> and kind of traditional machine learning
+> might take a video of this pendulum and
+> try to use an autoencoder neural network
+> to uh reduce the dimension of this video
+> into a minimal latent State given by
+> this red Vector here such that you can
+> use that minimal State and then recreate
+> the video from those very few variables
+> okay we know that this is a low
+> dimensional system we think that this is
+> probably described by you know a
+> variable Theta the angle maybe it's
+> velocity kind of naive machine learning
+> would just you know try to find some
+> coordinates that you can compress this
+> data down to so that you could also then
+> expand it back into the full video
+> whereas physics inform machine learning
+> is trying to do something a little bit
+> more sophisticated we're not just trying
+> to learn the coordinates uh to compress
+> this video we're also trying to learn
+> maybe a differential equation for how
+> those coordinates evolve in time
+> something like this equation we know
+> from the the damped pendulum equation
+> and again you can think of this in one
+> of two ways I could either if I knew
+> that this was the equation I could try
+> to bake that into this latent State uh
+> in these these red coordinates here to
+> try to learn the right coordinates um
+> with my machine learning model if I
+> baked in this known physics I might be
+> able to do a better job or alternatively
+> I could try to learn this differential
+> equation purely from this measurement
+> data now again zooming into this example
+> we know that there's kind of two key
+> ingredients if we were going to do this
+> using textbook methods the first would
+> be I need to learn a good coordinate
+> system Theta in which I can represent
+> this uh pendulum motion Theta is a lot
+> better than the X and Y position of the
+> tip for example and then once I learn
+> that coordinate I want to learn some
+> equations of motion some differential
+> equation that describes how the system
+> evolves in time so that's kind of what
+> we mean by physics and form machine
+> learning often times that means learning
+> the right uh coordinates and Lear
+> learning some governing equations or
+> some physical laws which often times are
+> things like differential equations that
+> describe the motion in time okay and
+> again we can either learn that from our
+> data or we can bake some of these
+> elements into our machine learning
+> models
+> good okay so we're going to zoom out and
+> we're going to be talking about
+> specifically the stages that go into
+> building a classic machine learning
+> model so I'm going to break this down
+> into five key stages that we typically
+> use to build any machine learning model
+> and then I'm going to show you how we
+> can make physics into each of those five
+> stages now this is not 100% accurate I I
+> say you know this is 80 90% accurate
+> tops but this is a good kind of
+> guideline for how the process of machine
+> Learning Works that we can use to kind
+> of organize this this lecture Series so
+> the first of five stages in machine
+> learning is essentially deciding on a
+> problem what are we going to model with
+> our machine learning model okay so that
+> means deciding on what are the inputs
+> and the outputs what relationship am I
+> trying to model um so it could be you
+> know I have pictures of dogs and cats
+> and I want to build a classifier to
+> label them as dogs or cats we've decided
+> on a problem of what we're going to
+> model OKAY stage two once I've done that
+> I need to gather and curate data
+> training data that I'm going to use to
+> train that model okay so what data will
+> inform that model and how do I gather
+> that data so again in that dogs and cats
+> example I need a bunch of pictures of
+> dogs and cats and I have to decide do I
+> actually have the labels do I know which
+> images are dogs and which ones are cats
+> you know that that's my my data curation
+> often one of the most expensive
+> processes uh in this pipeline sometimes
+> this can be you know millions or tens of
+> millions of dollars to get that data
+> set stage three and this is where a lot
+> of people get excited uh myself included
+> is stage three you get to design some
+> kind of an architecture so we know what
+> problem we're trying to model we know
+> the inputs and the outputs uh that we
+> want to be kind of learning the
+> relationship between stage three is
+> where we choose an architecture like a
+> you know special type of neural network
+> or a Cindy model or an auto encoder
+> whatever type of architecture you choose
+> that because you think that it's going
+> to do a good job of representing that
+> functional relationship between the
+> inputs and outputs and so a lot of work
+> goes into designing custom architectures
+> that are good for specific types of
+> problems we know that if we want to do
+> image classification we use
+> convolutional neural networks if we want
+> to model time series we use recurrent
+> neural networks and things like that so
+> there's a lot of kind of knowhow in
+> choosing the architecture in fact this
+> is an architecture of an auto encoder
+> that is used for compressing the
+> dimension of an input down before
+> expanding it back out okay stage four so
+> now we know the problem we have our data
+> and we have a candidate architecture we
+> think we might be able to train to model
+> that input output relationship stage
+> four is crafting a loss function some
+> kind of an objective function that tells
+> you if your models did a good job so
+> usually this includes terms like the
+> actual error of the input output um
+> prediction of your neural network or
+> whatever your machine learning
+> architecture is with the actual data you
+> use to train it and there's all kinds of
+> other loss functions you can add you can
+> add regularization terms to make your
+> model more sparse or more smooth or
+> things like that so again uh steps three
+> and four are where a lot of the modern
+> research uh in machine learning and in
+> physics machine informed machine
+> learning uh exists today okay a lot a
+> lot of research in stages three and
+> four and step five once we have our
+> architecture and our loss function we
+> essentially pick some optimization
+> algorithm to train the model okay and by
+> training the model what we mean is we
+> use optimization to tweak the parameters
+> of our architecture to tune and learn
+> the parameters of this architecture to
+> minimize our loss function averaged over
+> our data this is such an important point
+> I'm going to say this again okay so our
+> architecture is some class of functions
+> that you can use to represent the input
+> output relationships you want to to
+> model okay so there are essentially
+> parameters of this architecture you can
+> tune in the case of a neural network
+> it's literally the weights of all of the
+> connections in that neural network you
+> get to tune those weights to learn some
+> kind of input output function and so we
+> use optimization like the atom Optimizer
+> and many other types of you know
+> stochastic gradient descent
+> optimization
+> to to tune those parameters of your
+> architecture to minimize your loss
+> function averaged over your training
+> data and that's really the entire
+> process of machine learning okay we
+> decide on a problem we collect our data
+> we pick a class of functions we think
+> can represent that input output
+> relationship we call that an
+> architecture like a specific kind of
+> neural network we design or craft a loss
+> function that tells us if we did a good
+> job and then we use optimization to
+> tweak the parameters of our architecture
+> to learn those parameters that minimize
+> the loss function averaged over our
+> training
+> data and the reason I lay it out like
+> this in these five stages is because
+> each of these five stages gives unique
+> opportunities for embedding physics into
+> the process and in some cases for
+> discovering physics so if we have
+> partial knowledge of the physics if I
+> know that I am trying to model you know
+> the lift over an air foil or some
+> material property or some robotic arm
+> with you know joints and linkages then
+> at each of these stages each of these
+> five stages I can embed knowledge of the
+> physics and improve this machine
+> learning process I can constrain the
+> architecture I can add custom loss
+> functions there are certain types of
+> optimizations I can use that will
+> respect this physics more than others
+> and so all of these stages give us
+> unique and Powerful ways of adding
+> physics into the
+> process so we're going to use this
+> diagram to organize the almost this
+> entire kind of uh short you know module
+> or boot camp in this larger lecture
+> Series so we're going to zoom into each
+> of these five examples and we're going
+> to look at how uh how to embed physics
+> and extract physics in each of these
+> stages of machine
+> learning so um if you want to learn more
+> about this uh some of this is in a
+> textbook written by Nathan Kutz and
+> myself datadriven science and
+> engineering you can download a free pdf
+> uh of it here and I'm hoping that you
+> know around the time that this video is
+> released there will be other resources
+> you know notes on this and things like
+> that uh so check the um video
+> description for more links to resources
+> um that that you can
+> use now again we think a lot about
+> architectures that's one of the kind of
+> exciting things that gets people you
+> know a lot of people get excited by all
+> of the cool architectures you can try on
+> a new uh machine learning task
+> especially in physics inform formed
+> machine
+> learning and there is a zoo of possible
+> you know architectures just for neural
+> networks and I want to make an important
+> Point not all machine learning is just
+> neural networks there's a lot of other
+> architectures that are not neural
+> networks that we can use um to learn
+> models from data but even within the
+> realm of just looking at neural networks
+> there is this kind of managerie this
+> this zoo of neural networks and a lot of
+> times what we do is kind of akin to uh
+> Alchemy so kind of modern day Al Alchemy
+> where you have a problem and through
+> some combination of intuition maybe
+> reading Reddit posts or seeing what your
+> friend did we have some idea of what
+> types of neural network architectures
+> might or might not work for our problem
+> and then we often end up just trying a
+> bunch of them and sometimes they work
+> and sometimes they don't sometimes you
+> can turn lead into gold metaphorically
+> speaking and sometimes you can't and so
+> one of the big pushes of applying
+> machine learning to physical systems
+> where we actually do have some prior
+> knowledge we know a lot about these
+> systems is that maybe we can start also
+> learning principles of how and when
+> different neural network architectures
+> actually are appropriate for a given
+> problem so this is a really really
+> important point you know obviously we
+> want to be doing better engineering and
+> better science and we want to use the
+> best tools at our disposal which
+> nowadays you know machine learning is in
+> that tool set but there is another
+> perspective Ive which is by applying
+> machine learning to physical problems
+> where we sometimes know the answer we
+> can learn a lot more about these um
+> machine learning algorithms and models
+> then we would if we just apply it to a
+> system where we don't know the answer
+> things like you know trying to cure
+> cancer okay so if you apply this to
+> systems where you know the answer we
+> might be able to go from this Alchemy
+> stage of machine learning to something
+> that more closely resembles chemistry
+> where there's actual principles and
+> organization to when and how to use
+> these building blocks for a particular
+> problem okay um that's something we are
+> very very you know interested in doing
+> as a community there's a ton of effort
+> uh kind of trying to learn these basic
+> principles of machine learning by
+> applying them to physics systems systems
+> where we know the answer and Engineering
+> Systems and this is actually one of the
+> main thrusts of this NSF funded AI
+> Institute in Dynamic
+> systems um you can kind of go to the
+> website and Google it or and check it
+> out but there is this NSF funded AI
+> Institute in Dynamic systems which is
+> really focused on trying to understand
+> how can you embed physics into machine
+> Learning Systems how can you use machine
+> Learning Systems to learn new physics to
+> solve engineering tasks in these you
+> know various application domains and to
+> go from that kind of alchemy uh model of
+> just you know guess and check and trial
+> and error to something much more
+> principled okay and so this is a
+> collaboration with many many people and
+> it's actually bigger than just this NSF
+> Institute um trying to learn these
+> principles of machine learning okay so
+> really exciting uh field and topic and
+> area and it's a really great time to be
+> a researcher uh stepping into this field
+> because it's you know there's a ton of
+> open problems unsolved challenges and
+> you know many many many uh billion dooll
+> Industries can be transformed with these
+> techniques uh in the coming
+> years good okay so uh I'm just going to
+> zoom into these five uh kind of areas of
+> machine learning and talk about examples
+> of how we bake physics in um this is
+> just going to be a very very fast
+> highlevel overview I'm going to have
+> entire videos going into each of these
+> sub problems so we're going to have a
+> whole video talking about how you can
+> bake physics into this stage one and
+> stage two and so on and so forth but
+> let's just kind of zoom in at a very
+> high level so the first stage of machine
+> learning is deciding on the problem
+> sometimes this doesn't even go uh it
+> goes unstated you know people take this
+> for granted but this is an important
+> part of the machine learning process is
+> picking what you actually think you can
+> model with your data what you want to
+> model um what would be a useful model
+> why do you want to model those are all
+> important questions and fundamentally at
+> the end of the day if you're modeling a
+> physical system like you know a lava
+> lamp or a pendulum or a fluid flow if
+> you're modeling um you know a materials
+> process or or something that is actually
+> physical then you're kind of already
+> doing this this you know physics based
+> machine learning it physics is already
+> in the process when you decide that
+> you're trying to model some physical
+> process so for example if I'm going to
+> you know in my machine learning model
+> try to learn the forces on this pendulum
+> you know so we know f equals ma if I try
+> to learn those forces with a machine
+> learning model I'm already baking
+> physics into the process through the
+> problem statement similarly I could try
+> to learn you know a hamiltonian or a
+> lran I could try to learn uh a free
+> energy potential or um you know lots and
+> lots of different ways you can you can
+> bake in physics just by setting up the
+> problem in a clever
+> way okay stage two curating the data so
+> what data am I going to use again if the
+> data comes from a physical system
+> inherently you are kind of of embedding
+> physics through um that that data
+> collection and sometimes I joke that
+> stage two is kind of the Google way of
+> baking physics into the model because if
+> you the idea is if you collect enough
+> data of the natural world it kind of has
+> to learn physics like FAL Ma and maybe
+> even eventually things like eal mc^2
+> just to reconcile all of that data that
+> it's collecting but the reason I say
+> that that's like the Google approach is
+> because it's very very expensive that
+> relies on having you know nearly
+> infinite amounts of data which often as
+> scientists and Engineers we don't have
+> that luxury so we much much more often
+> have targeted limited kind of narrow
+> data sets that we have to use to learn
+> the physics that extrapolates more
+> broadly okay um and this is kind of
+> subtle again we're going to have a whole
+> video going into this but I want to to
+> briefly mention it so you know for
+> example am I going to take fluid flow
+> Fields as my training data and I'm I am
+> I going to have them at different flow
+> velocities you know that's a very
+> different model if it only has one fluid
+> flow velocity or it has a range of
+> velocities that will change kind of what
+> physics I'm capturing in my model
+> similarly you know if I'd have uh fluid
+> flow geophysical fluid flow this is
+> actually cloud formation uh past
+> Guadalupe Island if I have this kind of
+> data one of the things I can do in my
+> data curation process to uh add physics
+> is if I think that the physics doesn't
+> actually depend on you know the angle of
+> rotation or translation if I think that
+> my physics is invariant to some kind of
+> symmetry like translation or rotation I
+> can augment my data with additional
+> copies that are transformed according to
+> that in this case you know rotations in
+> other cases translations that's again a
+> very very common thing people do is that
+> if you know that your system has some
+> symmetry or invariance which is another
+> form of physics you can augment that
+> data to kind of have uh to transform
+> your data to have those symmetries and
+> invariance so for example if I'm
+> building a classifier that can tell you
+> know a Prius from a Ford pickup truck
+> and maybe you know other cars and trucks
+> I'm going to take my images of cars and
+> trucks and I might rotate them and
+> translate them and scale them because
+> none of that should matter for the
+> classification so I can take my data and
+> augment it it to include those
+> symmetries uh and invariances that we
+> think our model should be uh invariant
+> to good um and again one of the things
+> I'm going to talk about a lot and I've
+> talked about this in previous videos and
+> we're going to zoom into it here is the
+> coordinates typically matter a lot when
+> we're doing uh machine learning on
+> physical systems so this is a cartoon uh
+> from M and christon that I really like I
+> use this one a lot that shows these kind
+> of two different coordinate systems
+> systems for the planets and the Sun and
+> the Earth the one on the right this is
+> the geocentric view that is kind of what
+> things look like from our Earth centered
+> perspective and it's really hard to
+> learn physics and models in this
+> coordinate system whereas if we fix the
+> coordinate system and put the Sun at the
+> center it's a lot easier the data makes
+> more sense and it's easier to build a
+> model from data so often curating the
+> data means finding the right coordinate
+> system so sometimes we learn that
+> sometimes we kind of force that through
+> prior knowledge of the physics and it
+> makes a huge huge difference in the
+> learning process good uh okay stage
+> three designing an architecture we're
+> going to spend a ton of time here
+> probably you know five or 10 hours are
+> going to be dedicated just to all of the
+> different architectures on the market
+> different loss functions for those
+> architectures um again at a mile high we
+> might have things like lran neural
+> networks so if you know that your system
+> has this kind of Oiler lrange equation
+> uh framing if it's a mechanical system
+> like a double pendulum you might use a
+> specific architecture like a lran neural
+> network um so that your system conserves
+> energy important Point almost all of
+> these architectures also have a
+> coresponding set of custom loss
+> functions that are needed to train them
+> so stages three and four really do
+> sometimes kind of uh get a little bit
+> mixed okay so this is just one example
+> of a particular structure for a physical
+> system
+> um we also have um one notion of of kind
+> of what is physical is uh parsimony we
+> like our models to be as simple as
+> possible to describe the data but no
+> simpler that's the approach here in our
+> Cindy autoencoder where you use an
+> autoencoder to learn a good coordinate
+> system and then you have the sparest
+> model in some class of models that
+> describes the data okay so we're going
+> to use that principle of low dimensional
+> and sparse to promote models that are as
+> simple as possible to describe the data
+> and no simpler this principle of
+> parsimony that's been with us for 2,000
+> years from you know Aristotle to
+> Einstein that's been one of the ways
+> we've we've kind of measured if a system
+> is physical is in its Simplicity and its
+> ability to describe measurement data
+> okay so this is another type of
+> architecture um and then you know this
+> is actually one of my favorite
+> architectures it's it's one of the
+> earlier ones in the modern physics
+> inform formed machine learning era
+> um by Julia Ling and
+> collaborators they were using they were
+> using neural networks to try to predict
+> uh to try to build closure models for
+> for turbulent fluid flows and their
+> architecture is different than a
+> standard deep neural network
+> architecture they have this additional
+> auxiliary kind of tenser input layer
+> that allows their Network to be
+> invariant to Galilean Transformations
+> again we know that these fluid flows
+> should be the same you know under
+> certain kinds of rotations and
+> translations and so through a choice of
+> architecture they made it so that their
+> models have to have that kind of uh
+> invariance and they perform way way
+> better than if you don't include that
+> into the machine learning process these
+> are just three of like literally
+> hundreds of different types of custom
+> architectures that you can use to imbue
+> certain types of physics into your
+> system now um I was talking to my wife
+> Bing about this earlier and she pointed
+> out when you choose an
+> architecture to be to make your system
+> more physical to make your machine
+> learning model more physical that's
+> something called an inductive bias it's
+> kind of a bias that goes uh unspoken but
+> it highly steers or influences the
+> downstream machine learning process so
+> we're going to talk about inductive
+> biases uh and choices of architectures a
+> bit more later in this section
+> okay good uh moving on so you know item
+> four this is another really really rich
+> area of research is how do you build
+> loss functions that promote models that
+> are more physical in certain ways um we
+> already talked about those sparse models
+> parsimonious models that are as simple
+> as possible to describe the data that is
+> Quantified by a term in the loss
+> function the L1 Norm or the l0 norm
+> quantifies sparsity and parsimony so
+> often times you know physics is promoted
+> through additional regularizing terms or
+> loss functions um in your training
+> process probably the most famous example
+> of this is the physics informed neural
+> network or
+> pin um Illustrated here so this is just
+> a standard uh kind of you know deep feed
+> forward neural network used to map some
+> inputs like space and time to some
+> outputs like a fluid flow field and
+> normally what I would do is I would just
+> have some loss function that averages
+> you know how accurate is this model over
+> some training data that's kind of the
+> naive way of doing it what the pins does
+> is that from this output data you can
+> actually compute terms in your partial
+> differential equation and you can add
+> your actual physics the partial
+> differential equation that the system
+> should be satisfying as another loss
+> term in your loss function so if you
+> know the physics if you know that your
+> system is Divergence free or satisfies
+> The navier Stokes equations or the
+> elastic beam equation whatever the
+> equation is if you know it you can add
+> in a regularizing loss term to your
+> machine learning process and this should
+> dramatically improve the learning
+> performance you can get away with way
+> less data and get much better model
+> performance if you include the physics
+> as a loss function so again we're going
+> to have a whole lecture just on this one
+> topic um because it's so important codes
+> and examples and case studies and
+> everything okay the the last step um and
+> this one I think is super important it's
+> not quite as common as stages three and
+> four for embedding physics but it's very
+> very important and we do a lot of work
+> in this um in my lab and in my
+> collaborators groups is using
+> optimization to enforce physics so if I
+> add physics as a term in the loss
+> function that is essentially adding the
+> suggestion that we want physics to be
+> satisfied but it's always fighting you
+> know we want low model
+> and we want the physics to be satisfied
+> and they sometimes are dueling
+> objectives so just adding physics as a
+> loss function only kind of promotes
+> physics but it doesn't enforce that
+> physics is satisfied but often with
+> optimization you can constrain um and
+> enforce that your physics is satisfied
+> at a much higher Precision you know at a
+> much um a much more stringent type of
+> enforcing of physics so um for example
+> we've done some work in in fluid flow
+> modeling with machine learning and we
+> know that there are certain symmetries
+> and properties like energy is conserved
+> in these incompressible flows and so
+> what I could do is I could build a loss
+> function that has my model error and
+> some constraints that are necessary for
+> this to conserve
+> energy or instead of just having this as
+> a loss function I could do a constrained
+> lease squares so that these constraints
+> are satisfied always to numerical
+> Precision by Construction so again if
+> this doesn't make perfect sense right
+> now it's okay we're going to go into
+> depth in each of these topics and have
+> you know at least one video but probably
+> a couple of hours of material on each of
+> these topics but the idea is you can
+> promote if I know some constraints that
+> should be satisfied for my system to be
+> physical things like energy conservation
+> or a certain type of symmetry I can
+> either add that in as a term in my loss
+> function but it's not going to go to
+> zero through the optimization process if
+> it's just in my loss
+> function or I can do things like use
+> lrange multipliers or constrained lease
+> squares so that I can change my
+> optimization problem so that these
+> constraints are exactly satisfied uh
+> every you know every time I optimize
+> this function so that's something we do
+> a lot um and there are really powerful
+> methods for enforcing physics this way
+> it's a little more involved it gives you
+> better physics imp position but it's
+> also you know more human effort to
+> actually build this into the optim ation
+> function another example um so we're
+> going to find kind of a theme in this
+> lecture series is um we're going to talk
+> a lot about what is physics in the first
+> place what is this physics we're talking
+> about and often times physics manifests
+> itself through symmetries okay um I am
+> bilaterally symmetric meaning I have
+> mirror symmetry so if I flip myself in
+> the mirror essentially nothing changes
+> and lots of physical systems have other
+> types of symmetries we've talked about
+> translational symmetry rotational
+> symmetry um this is you know a Quantum
+> potential well with um a hexagonal
+> potential so it has kind of this um
+> interesting symmetry group of uh
+> rotations and
+> Reflections and um I I'll talk about
+> this later but there are methods like
+> physics inform DMD where you actually
+> constrain your solution to live on a
+> certain manifold of solutions
+> satisfy that symmetry so this is again
+> going to be a whole probably mini short
+> course all about symmetries and how to
+> bake symmetries into machine learning
+> and how to extract symmetries uh from
+> your data with machine learning huge
+> huge topic very very important and this
+> is going to change how we do machine
+> learning it is currently changing how we
+> do machine learning and it's only going
+> to um continue to improve our
+> capabilities in the next 5 10
+> years okay so those are the five stages
+> those are at least you know five clear
+> opportunities of how to bake physics
+> into your machine learning process now
+> I'm just going to give you a super fast
+> sneak peek of the rest of what you're
+> going to see um at least in the next few
+> hours um so you know I guess first what
+> I'm going to do is I'm just going to
+> show you kind of a pictoral example of
+> those five stages just so you can see it
+> in a different way and then I'm going to
+> give you a sneak peek uh of some of the
+> things we're going to do you know coming
+> up next so again stage one is setting up
+> the problem so I'm going to give I'm
+> going to walk through a specific example
+> now where the problem I want to solve is
+> model reduction of a fluid flow so
+> problem is I want to take a complicated
+> model of a fluid and I want a simpler
+> model of that fluid something that runs
+> faster on a smaller
+> computer so I need some data I need to
+> curate some uh measurement data in this
+> case maybe I have a simulation an
+> expensive simulation of a full
+> complexity
+> you know numerical simulation of a fluid
+> flow so my problem is trying to get a
+> simpler description I start with the
+> high-dimensional data stage three is
+> choosing an
+> architecture so in this case the
+> architecture I'm going to choose is this
+> kind of autoencoder architecture because
+> it has the properties I want it goes
+> from high dimension to low Dimension and
+> these low dimensional States Z are
+> chosen so that they are most likely to
+> be able to reconstruct the high
+> dimensional State again
+> and I'm also going to say that I want to
+> be able to learn a dynamical system some
+> differential equation on that low
+> dimensional latent State that's what I
+> mean by model reduction I want to take
+> my high dimensional system compress it
+> down to a low dimensional State and
+> learn some Dynamics some differential
+> equation in that low dimensional State
+> it'll be much much faster then to run
+> this simulation maybe in real time and
+> I'll get most of the physics that's
+> happening in this system okay so those
+> are the first three stages
+> stage four is crafting a loss function
+> so we need to build some kind of a
+> function that tells me if my model did a
+> good job now remember our architecture
+> is some function that Maps inputs to
+> outputs to do you know the problem to
+> solve the problem we want to solve and
+> it has parameters Theta these parameters
+> are tuning parameters that I can tweak
+> to make this function fit the data okay
+> in this case I have you know my three
+> neural networks each of them has a set
+> of parameters Theta 1 Theta 2 and theta
+> 3 and I can tweak those parameters to
+> make this function fit this data better
+> okay and that's Quantified by this loss
+> function what I mean by by a better fit
+> and finally the fifth stage is to
+> optimize over all of those thetas
+> literally to tweak the thetas to
+> minimize the loss function averaged over
+> the training data this is just a
+> restatement of what we've already said
+> but in a slightly different pictoral uh
+> fashion
+> okay okay good so again now we're going
+> to zoom out do a sneak preview of what
+> we're going to see next uh in this
+> series so applications and Engineering I
+> think this is a huge one we're going to
+> have lots of case studies and
+> applications looking into how this has
+> actually been used in various use cases
+> and what I'm going to show here is just
+> a small window of the many many
+> applications um that there are and we're
+> going to explore in this series so
+> things like again modeling fluid flows
+> and
+> turbulence um things like shape
+> optimization how do you actually design
+> a wing that has the right lift overd
+> drag characteristics and that has the
+> right structural characteristics this is
+> a big multi-objective optimization
+> problem and remember training a machine
+> learning model is also a multi-objective
+> optimization problem so these kind of
+> make sense that they would go together
+> so things like modeling fluid flows
+> maybe designing you know the shape of an
+> aircraft or a car or a wind
+> turbine um designing a material a
+> composite or an alloy that you're going
+> to use in these engineering Downstream
+> applications again multiobjective
+> optimization very high-dimensional
+> design space for uh for
+> materials um digital twins so I might
+> have some physical aspect like a robot
+> arm or like an aircraft or a factory
+> floor and I might want some digital
+> representation some Digital model that I
+> can design and optimize over uh like
+> this digital twin here um and then kind
+> of just robotics in general Robotics and
+> autonomy is an area that is rapidly uh
+> advancing with new machine learning
+> technology and again this is just a
+> pretty small um you know subset of the
+> applications that there are and that
+> we're going to talk about other examples
+> include drug Discovery protein
+> design um you know and those are systems
+> that are governed by by physics there's
+> physics and how proteins fold and how
+> you know chemistry Works um so you know
+> those are also big big applications um
+> other areas things like climate science
+> understanding weather phenomena and
+> patterns and climate Trends again
+> related to turbulence but a very very
+> different context okay tons of
+> applications uh of
+> this um digital twin is really an
+> important concept here when we think
+> about building a machine learning model
+> that is trained from data and has this
+> kind of hybrid physics and data driven
+> flavor that's very much what a digital
+> twin is going for it's kind of a living
+> uh model of some complex system like an
+> aircraft or a factory floor or you know
+> a materials uh baking process like a
+> composite baking process where you want
+> this digital uh twin of your physical
+> asset to be as accurate as possible
+> possible you want your model to update
+> as you collect new data so your machine
+> learning models need to be able to
+> update you need to have uncertainty
+> estimates of how good that model is or
+> where it's the most uncertain so you can
+> either know if you can trust your model
+> and in you know the best case scenario
+> you know how to go collect more data to
+> improve your model all of those are
+> working towards this notion of a digital
+> twin which in principle should allow us
+> to design to do better engineering
+> design cheaper and faster and safer so
+> we're gonna have again a whole short
+> course on digital twin engineering where
+> you know we talk about how do you build
+> these digital twin models and how do you
+> use them for better design uh testing
+> evaluation things like that okay I
+> really like this diagram here from this
+> uh paper by uh captin at all in uh
+> nature computational science and get
+> kind of is a nice representation of this
+> kind of Duality between the physical and
+> the digital twin of that physical asset
+> okay so um again physics and for machine
+> learning if you're going to use a
+> digital twin for something like an
+> aircraft it better be you know physical
+> it better have that physics baked into
+> that
+> process okay we're also going to talk
+> about the importance of Benchmark
+> systems so uh most of the progress in
+> classic machine learning image Sciences
+> natural language processing things like
+> that a lot of that progress is because
+> we had very good Benchmark problems and
+> data sets that you could use that the
+> community could use to test their
+> methods on so a researcher didn't have
+> to create a benchmark problem and the
+> Machine learning algorithm they could
+> focus their work on the machine learning
+> algorithms okay and so increasingly
+> we're finding that we need those kinds
+> of Benchmark problems for Dynamics and
+> control for physics and Engineering
+> problems so I might need you know these
+> static data sets of of our systems um
+> they're not static in the sense the
+> system is evolving in time but once I
+> collect that data the data doesn't
+> change it's a static data set of a
+> dynamical system it could be simulated
+> or
+> experimental um but really I want to be
+> you know developing these Benchmark
+> systems
+> for engineering purposes where we're
+> actually trying to manipulate the
+> behavior of the system we actually want
+> to control the fluid with our machine
+> learning model so I need to be able to
+> interact with that simulation and
+> actually close the loop with my machine
+> learning model so we're developing
+> Benchmark problems in that kind of
+> interactive uh control framework as well
+> and then ideally we would go you know to
+> actual living cyber physical systems
+> like an actual piece of laboratory
+> equipment that you can train your
+> machine learning model on in real time
+> that's much closer to the kind of
+> digital twin notion that we're going to
+> be getting towards again this is going
+> to be an entire short course on all of
+> the benchmarks that are out there what
+> are the characteristics that you need
+> for a physics and for machine learning
+> Benchmark and so on so this is uh an
+> important topic we're going to come back
+> to but we need to be thinking about you
+> know how are we going to test our models
+> and what is the ground truth do we have
+> ground truth in some of these
+> cases um architectures we already talked
+> about the importance of that stage three
+> and really it's architectures and the
+> loss functions we use to train these
+> architectures but we're going to cover a
+> lot of the really really important ones
+> things like uh res Nets and UNS we're
+> going to cover uh fora neural operators
+> and general kind of operator methods uh
+> we'll talk about Cindy and you know kind
+> of Library regression and parsimonious
+> symbolic regression modeling we'll talk
+> about physics and form neural networks
+> uh deep operator networks and a bunch
+> more this is just kind of a a quick
+> smattering of some of the architectures
+> we're going to look at but we're going
+> to spend a lot of our time doing whole
+> videos or even whole video series diving
+> into these really really important
+> topics how do you build these loss
+> functions how do you train these what
+> applications do they work on you know
+> what how do we actually code this up and
+> try it ourselves okay so we're going to
+> really dig into a lot of these
+> applications uh and you can actually
+> take these then and use these to build
+> your own um you know examples and and
+> try this out
+> yourself and again um we're going to
+> throughout this entire series revisit
+> this notion of like what is physics in
+> the first Place why do we want to bake
+> physics into our systems why do we need
+> to bake physics into our systems when we
+> try to learn physics from data what does
+> that even mean what are we learning and
+> time and time again we're going to find
+> that things like conservation laws
+> invariances and symmetries are some of
+> the ways that we end up uh encapsulating
+> physics so we're going to again see
+> symmetries a lot you know in uh drug
+> Discovery and protein folding there are
+> symmetries fluid flows and mechanical
+> systems uh material systems Quantum
+> systems symmetries are fundamental to
+> physics and so symmetries and
+> conservation laws and invariances are
+> going to keep coming up over and over
+> again also this notion of parsimony or
+> uh Simplicity of the model and in the
+> process of doing this we're also going
+> to not lose track of the history uh of
+> Science and the history of physics
+> because that's going to give us a lot of
+> important parallels that we can learn
+> from in this modern machine learning era
+> so the difference between astrology and
+> astronomy what changed the difference
+> between Alchemy and chemistry okay and
+> it really there is a lot of that uh
+> parallel happening today in machine
+> learning and so we can learn a lot from
+> the history of science uh when we're
+> doing this kind of new physics and form
+> machine
+> learning okay uh so that's the that's
+> the topic that's the course um physics
+> inform form machine learning we're going
+> to have tons of short uh focused modules
+> on each of these
+> topics um I see this fitting into you
+> know a larger coursework um that we're
+> hoping you know to actually offer for
+> credit uh at University of Washington at
+> some point in the future so you know I
+> hope you enjoy this it's a passion of
+> mine it's one of the most important
+> fields of machine learning and this is
+> going to increasingly become uh you know
+> very powerful tool in Your Arsenal to
+> solve real world problems all right
+> thank you you
+> 
+## Summary
+
+The video provides a comprehensive introduction to Physics-Informed Neural Networks (PINNs), a powerful approach that combines traditional neural networks with physics constraints encoded in the loss function. The key innovation of PINNs is using automatic differentiation to compute derivatives needed for physical laws, and adding these as soft constraints during training.
+
+> "it's based on a kind of neural network idea that also incorporates known physics in the form of a partial differential equation in the loss function"
+
+> "what these authors did in this kind of classic now PINN paper is they extended this naive neural network picture... by computing things like partial derivatives of these output variables in terms of the input spatial and temporal variables... you can use that same idea in a modern language like PyTorch or JAX to compute these partial derivatives of the outputs with respect to the inputs"
+
+
+
+
+> >> this paper um out of Michael Mahoney's uh kind of group and and collaboration on Co characterizing the actual failure modes in pins and I'm not going to expect you to read all of this you should download the paper and check it out because they have open source code and you can play with it yourself   
+>    ... a set of physics pde problems where they can show under certain parameter regimes the method fails to train um and then essentially they looked at you know adding how what if you balance that that pte loss term so you can crank that pte loss term up or down that's a hyperparameter of how important ...
+>    
+>    28:54
+>    
+>    ... increase that physics loss from from zero to you know some large value that's a very clever and kind of straightforward idea and then uh their second idea is posing the learning problem as a sequence to sequence learning task so two concrete ideas for how to actually uh fix these failure modes and all ...
+## Key points
+
+- PINNs combine neural networks with physics-based loss terms
+  - > "it takes a standard kind of neural network representation of the field variables you want to predict... and it computes the partial derivatives that are relevant for the physics in this problem... from those computed partial derivatives you can create this additional loss term that tells you if this known physics is being violated or not"
+
+- PINNs can work effectively with limited data by leveraging physics constraints
+  - > "this works really well for systems where we know something about the physics... even if I only have pretty limited data here I can still test if my network is physical... I can evaluate that on test points that are not actually in my training data"
+
+- Physics constraints are only enforced softly as part of the loss function
+  - > "because the physics is added as a loss function that's a strength because it's very intuitive and easy but it also only suggests that your physics is being satisfied... you're almost never going to actually have this purple term here be exactly zero"
+
+- PINNs excel at reconstruction from sparse measurements
+  - > "it's really really good for estimating things like whole flow fields with fairly sparse sensor measurements... taking limited sensor data and inferring between the lines... of what the velocity field should look like to be consistent with your measurement data and to be consistent with your governing equations"
+
+- PINNs may struggle with certain systems (discontinuities, chaotic flows)
+  - > "this is going to have a harder time on systems that are discontinuous with things like shock waves or chaotic convecting flows... it's not going to probably be amazing at those kinds of flows"
+
+- Training PINNs can be challenging and often requires careful tuning
+  - > "These don't always train that well sometimes they're stiff sometimes they overfit there's issues"
+
+## Technical terms
+- **[[Physics-Informed Neural Networks (PINNs)]]**: Neural networks that incorporate physical laws as constraints in the loss function
+  - > "it's based on a kind of neural network idea that also incorporates known physics in the form of a partial differential equation in the loss function"
+
+- **[[Automatic differentiation]]**: The computational technique used to calculate derivatives in neural networks
+  - > "you can compute these terms essentially using the same kind of auto differentiation automatic differentiation and back propagation that would you would normally use to train a neural network"
+
+- **[[Fractional PINNs]]**: An extension of PINNs that can handle fractional derivatives and integral terms
+  - > "one of the ones I think is really cool are fractional PINNs for partial differential equations that have things like fractional derivatives... often times you have these nasty integral differential equations with fractional derivatives in your physics"
+
+- **[[Delta PINNs]]**: A PINN variant that incorporates knowledge about the geometry of the problem domain
+  - > "if you have a geometry... where my PDE is living you can actually bake into the PINNs... the eigenfunctions of the Laplace-Beltrami operator on that geometry give you a very good natural coordinate system in which to represent solution functions of that PDE"
+
+## Predictions
+
+- Traditional simulation methods will increasingly be supplemented by PINNs for certain applications
+  - > "in the past we used to use a method called 4D VAR... it was this horrific physics-based constrained optimization... now in PINNs is kind of abstracted from the human, it uses modern machine learning architectures and optimizations and it typically does as well in a lot of cases"
+
+## Surprises
+
+- PINNs can effectively reconstruct fields from visual data not just numerical measurements
+  - > "you can take very limited information and sometimes information that is not even like a velocity field like a smoke visualization or some kind of movie and you can infer what the actual velocity field that is kind of closest to satisfying the Navier-Stokes equations"
+
+- The physics loss in PINNs can actually make optimization more difficult in some cases
+  - > "they found that the loss function by increasing or decreasing that PDE-based soft constraint actually makes it more complex to optimize and harder to find a good solution in some cases"
+
+## Conclusion
+
+Physics-Informed Neural Networks represent an elegant approach at the intersection of machine learning and physics, allowing physics knowledge to guide neural network training through differential constraints. They excel at reconstructing fields from sparse data but require careful implementation. The method has seen widespread adoption and numerous extensions due to its intuitive nature and the ability to work with limited training data, though training challenges remain an active research area.
+
+# AI/ML+Physics Part 1: Choosing what to model [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/ARMk955pGbg/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=ARMk955pGbg&list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa&index=2&t=1s&pp=iAQB)
+## Summary
+
+This lecture focuses on the critical first stage of physics-informed machine learning: deciding what problem to actually model. The speaker emphasizes that this is perhaps the most foundational and important step, requiring the same rigor as traditional scientific and engineering design processes. The lecture covers various applications where machine learning can benefit physics problems, from discovering new physics to accelerating expensive simulations, while cautioning against using machine learning as a default solution when simpler methods would suffice.
+
+> "computers are useless they can only give you answers and it's kind of the exact same story with machine learning okay these datadriven models are useless they can only you know model your data you have to be the one to decide what is the right model what is the right Fidelity what would be a useful model and how are you going to use it Downstream that's a human endeavor"
+
+## Key Points
+
+- **Problem definition requires iterative refinement across all ML stages**
+  > "sometimes I don't know exactly what the problem is I have a vague notion like I Want to Build a Better race car or I want to build a better you know a better wing and that's not really a specific enough problem... very often we find oursel going down uh this path even all the way down youve trained your model and you realize it doesn't do exactly what you wanted it to"
+
+- **Four main reasons to use ML in physics applications**
+  > "learning new physics there are many many systems where we have a pretty good understanding of how the system works... but there are tons of systems where we don't know the physics so if I think about um Neuroscience we don't actually have governing equations for the brain or an epidemiological system or even the climate system"
+
+- **Automatic differentiability provides unique advantages for engineering design**
+  > "machine learning models are often not always but often kind of inherently automatically differentiable... we might be able to use this automatic differentiation that we're already using to train things like neural networks we might be able to use that to do better faster cheaper uh design optimization"
+
+- **Chaotic systems require different modeling approaches**
+  > "because the system is chaotic if I have a small uncertainty in my initial Condition it's going to massively amplify uh in the future so getting a deterministic prediction of the behavior of the system forward in time might be too much to ask for maybe what I want is a probability of being at some location in a future time"
+
+## Technical Terms
+
+- **[[Reynolds Average Navier Stokes (RANS)]]**: A turbulence modeling approach where Reynolds stresses require closure models
+  > "there's a turbulence model called Reynolds average nav Stokes... these Reynold dresses in yellow we don't have really really good models of those terms we have to approximate them that's the closure problem"
+
+- **[[Digital Twin]]**: A digital representation of a physical asset comprising multiple model fidelities
+  > "the digital twin uh comprises a hierarchy of models at different fidelities some will be you know really crude simulations... some will be uh High Fidelity models that are really really expensive and we might have these machine learning models kind of in the middle uh stitching together these different fidelities"
+
+- **[[Super Resolution]]**: Technique for enhancing low-resolution data to high-resolution using statistical patterns
+  > "super resolution is an idea from image Sciences where you have a low reses image uh and with kind of the statistical information you've collected over lots and lots of data you might be able to learn how to fill in that low reses image into a highres image"
+
+## Predictions
+
+- **Benchmark systems will become critical for physics-ML**
+  > "we need to have uh Benchmark systems kind of like image net for you know image classification we need similar benchmark works for dynamical systems for engineering systems for Control Systems"
+
+- **Multi-fidelity surrogate models will enable better design optimization**
+  > "you would hope that your surrogate machine learning model can take all of that data and build a model that's a little bit of The Best of Both Worlds you know low cost lower cost and lower error"
+
+## Surprises
+
+- **Problem definition may be the most important stage despite seeming simple**
+  > "at first I actually thought this was going to be the easiest video to make um because you know how complicated could it be... but it turns out I've actually been spending the last couple of hours thinking about this and every time I think I have uh figured it out there's something new I want to tell you because this is actually maybe the most important out of all of these stages"
+
+- **Turbulence closure modeling hit a wall that ML could overcome**
+  > "turbulence closure modeling Reynolds average Navy or Stokes modeling is something where you know we had a huge amount of progress up in until you know the 1970s and 80s... at some point researchers hit a wall because these are really nasty you know functions to approximate... that's the perfect example of what machine learning is good at"
+
+- **The choice between learning differential equations vs. discrete time steppers is fundamental**
+  > "even that is a is a choice you have to make and there's a huge uh difference between these so the entire resnet architecture lives over here and the neural OD architecture lives over here they're solving related but different problems"
+
+## Conclusion
+
+The lecture establishes that problem formulation in physics-informed machine learning requires the same rigor as traditional scientific inquiry. The speaker emphasizes that machine learning should be viewed as a tool to augment physics understanding rather than replace it, with specific applications in discovering new physics, accelerating expensive simulations, and enabling better design optimization through automatic differentiability. The key insight is that proper problem definition often determines success more than the choice of architecture or optimization method.
+
+# AI/ML+Physics Part 2: Curating Training Data [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/g-S0m2zcKUg/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=g-S0m2zcKUg&list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa&index=3&pp=iAQB)
+## Summary
+
+This video discusses the second stage of physics-informed machine learning: data curation. The speaker covers various aspects of incorporating physics knowledge into the data collection and preparation process, emphasizing that this approach can significantly reduce data requirements compared to traditional machine learning methods.
+
+The speaker argues that while companies like Google and Microsoft can rely on massive datasets to implicitly learn physics, most engineering applications need explicit physics integration to work with limited, expensive data:
+
+> "if we embed physics into our learning process we can often get away with a lot less data... if I can incorporate physics to constrain that neural network to reduce the search space in that neural network I can often get away with much less data"
+
+## Key Points
+
+- **Data augmentation through physics symmetries**
+  - > "if I think that this fluid system that I'm trying to model is invariant to rotations like if I rotate things nothing should change then in principle I can augment my data to include rotated copies and that enriches my data"
+
+- **Critical importance of coordinate systems**
+  - > "the coordinate system in which you measure your data really really really really matters and that is an opportunity for your human expertise for your physical knowledge and intuition to play a huge role"
+  - > "getting your measurements right and in the right coordinate system is a huge Head Start that you can give your machine learning model"
+
+- **Simulation vs experimental data tradeoffs**
+  - > "often times in simulations I have a lot more spatial information I have high resolution spatial fields that are very hard to get in experiments... whereas if I had a wind tunnel experiment I might not get the full fluid flow field but I can run this thing for hours and hours and hours"
+  - > "experiments have the real nitty-gritty details that we don't understand... the gold standard is still building experiments because that's where you're going to see if your assumptions broke"
+
+- **Design optimization requires extrapolation beyond training data**
+  - > "for design optimization if I want to design something better than my training data then I need my models my machine learning models to generalize beyond my training data"
+  - > "the way you get your machine learning model to generalize beyond the training data... is by baking in physics into those models"
+
+- **Data bias and rare events challenges**
+  - > "if I have training data of ocean waves you know 99.999% of them are not rogue waves... I can build an extremely accurate machine learning model just by focusing on this boring data here and just ignoring the existence of these rare events"
+
+- **Hidden variables in physical systems**
+  - > "most of the systems we interact with we don't have access to all of the variables that are important for that system and we have to build models on partial information"
+
+## Technical Terms
+
+- **[[Physics-informed machine learning]]**: Integration of physical knowledge and constraints into machine learning models to improve performance with limited data
+  - > "this introductory series on physics informed machine learning where we're essentially looking at the different opportunities and subtleties of incorporating physics into the machine learning process"
+
+- **[[Data augmentation]]**: Artificially expanding datasets by applying known physical transformations
+  - > "I can augment my data with copies of the data that I pass through those different Transformations"
+
+- **[[Digital twin]]**: Machine learning models of physical assets used for optimization and design
+  - > "digital models of physical assets so we can do improved optimization and design and control"
+
+- **[[Multi-fidelity data]]**: Combining data sources of different accuracy and computational cost
+  - > "lots of kind of in between areas where you might have multi Fidelity data sources or data from simulations and experiments"
+
+## Predictions
+
+- **Future of complex system design**
+  - > "this is increasingly going to be how we design really complex systems in the future... it's going to give us the best of both worlds lower cost and lower error"
+
+- **Active learning integration with digital twins**
+  - > "if it has a high uncertainty but the design is promising it can actually go back and collect more data and it can collect data at different fidelities based on its uncertainty"
+
+## Surprises
+
+- **Historical physics discoveries involved strategic data selection**
+  - > "presumably Galileo picked two densities that were dense enough that this could be kind of neglected and that was a very strategic human guided decision not to drop a beach ball but to drop denser objects"
+  - > "if you actually drop balls of different densities and you start to see these fluid effects you'll actually find that the more classic and incorrect model of Aristotle... actually fits the data better"
+
+- **Small signals can contain fundamental physics**
+  - > "Einstein's relativity is only required for capturing that tiny last little bit so if I'm training the machine learning model and it's just trying to minimize error or loss it's almost certainly going to pick Newton's second law"
+
+- **Single measurements can reconstruct full system states**
+  - > "from a single measurement a single scalar measurement of the x coordinate of this Lorent system we can start learning how to embed that through some neural network coordinate transformation in a way that we get a simple explanatory differential equation out"
+
+## Conclusion
+
+The speaker emphasizes that data curation in physics-informed ML is highly problem-dependent and requires careful consideration of coordinate systems, data sources, biases, and the ultimate application goals. The key insight is that incorporating physics knowledge at the data stage can dramatically reduce data requirements and improve model generalization, which is crucial for engineering applications where data is expensive and design objectives often lie outside the training distribution.
+
+# AI/ML+Physics Part 3: Designing an Architecture [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/fiX8c-4K0-Q/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=fiX8c-4K0-Q&list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa&index=4&pp=iAQB)
+## Summary
+
+This lecture introduces Stage 3 of physics-informed machine learning: designing architectures. The speaker defines physics in terms of interpretability, generalizability, parsimony/simplicity, and symmetries/invariances/conservation laws, then explores how neural network architectures can embed these physical principles. The discussion covers various architectures from ResNets to Physics-Informed Neural Networks (PINNs), emphasizing how architectural choices constrain the space of possible functions and can enforce physical properties like Galilean invariance or energy conservation.
+
+> "architectures Define a space of functions we're searching over and we find the function we want by tuning these free parameters Theta"
+
+## Key Points
+
+- **Physics definition for ML context**: Physics is characterized by four key properties
+  > "interpretable and generalizable interpretable in the sense that they're usually very very simple... and it's very generalizable because the same physics that describes this apple also describes you know the physics of launching a rocket uh from the Earth to the Moon"
+
+- **Architecture as function space constraint**: ML architectures fundamentally constrain the space of possible functions
+  > "what most machine learning architectures do is they constrain the space of possible functions that could describe this input output relationship through a choice of architecture"
+
+- **Parsimony principle**: Simplicity as a core physics principle dating back millennia
+  > "this principle of Simplicity or parsimony has been the gold standard in physics for 2,000 years from Aristotle to Einstein the models that are more beautiful more parsimonious as simple as possible and no simpler typically encapsulate the core bits of physics"
+
+- **Architecture-loss function coupling**: Architectures and loss functions are inherently intertwined
+  > "architectures usually have loss functions that are good uh to train those architectures... loss functions often rely on an architecture and architectures often have custom loss functions you use to train those models"
+
+- **Galilean invariance through architecture**: Julia Ling's work shows how architectural choices can enforce physical symmetries
+  > "through a choice of this architecture with this custom tenser layer enforces Galilean invariance by construction all of the models represented in this function space are Galilean invariant"
+
+## Technical Terms
+
+- **[[Galilean Invariance]]**: A fundamental physics principle where the laws of physics remain unchanged in any inertial reference frame
+  > "the physics doesn't change in any inertial reference frame so if I have a box of turbulence here or I have a box of turbulence moving at a constant velocity the physics the the closure model terms the physics shouldn't change in any inertial frame"
+
+- **[[Invariance vs Equivariance]]**: Two distinct mathematical concepts for handling symmetries in ML
+  > "invariance means that we know that the notion of a dog shouldn't matter if that dog is translated in the image or rotated or scaled... equivariance means that if I take my data and I transform it through some symmetry like a rotation some symmetry operation and then I run both of those through my neural network then the output of my neural network is also run through uh that rotation or translation"
+
+- **[[SINDY (Sparse Identification of Nonlinear Dynamics)]]**: A non-neural network architecture for discovering differential equations
+  > "Cindy the sparse identification of nonlinear Dynamics importantly this is not a neural network this is a generalized linear regression to learn a differential equation from data"
+
+- **[[Physics-Informed Neural Networks (PINNs)]]**: Networks that incorporate physical laws through automatic differentiation
+  > "you can take a normal feed forward Network that you would use to kind of predict those quantities and then because of the automatic differentiability uh of these neural network uh environments like py torch and tensor flow and Jacks you can often compute these partial derivatives of these quantities without having to like hard code it"
+
+## Predictions
+
+- **Extensive lecture series ahead**: The speaker plans substantial content on each architecture type
+  > "we're going to have hours and hours and hours of you know material and lectures on various architectures... I'm pretty sure I have like 5 hours of material on Cindy alone"
+
+- **Neuroscience-ML convergence**: Increasing integration between neuroscience and machine learning architectures
+  > "these two fields are definitely evolving and growing together uh both neuroscience and machine learning"
+
+## Surprises
+
+- **Non-neural architectures included**: SINDY is presented as an architecture despite being generalized linear regression
+  > "Cindy the sparse identification of nonlinear Dynamics importantly this is not a neural network this is a generalized linear regression to learn a differential equation from data okay and this is an architecture"
+
+- **Wikipedia physics definition inadequacy**: The speaker found the standard physics definition insufficient for ML contexts
+  > "I was at um a NPS Workshop about a month ago and I was giving a talk about you know machine learning for scientific discovery and I decided I should probably you know Wikipedia what is the definition of physics before I say that we're doing physics inform formed machine learning... that's fine and good but I don't like that as a working definition"
+
+- **Historical pattern of simplification**: Scientific progress consistently leads to simpler, not more complex descriptions
+  > "every time we've made this huge kind of Leap Forward in our understanding of physics things have actually gotten simpler the descriptions have gotten simpler and more Universal"
+
+## Conclusion
+
+The lecture establishes architectures as a critical stage in physics-informed machine learning, where the choice of network structure inherently constrains the space of learnable functions. By carefully designing architectures that embed physical principles like symmetries, conservation laws, and parsimony, researchers can create models that require less training data and generalize better. The upcoming series promises deep dives into specific architectures, with particular emphasis on how symmetries and invariances can be built into machine learning models by construction rather than learned from data augmentation.
+
+# AI/ML+Physics Part 4: Crafting a Loss Function [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/3SNkQ8jhKXc/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=3SNkQ8jhKXc&list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa&index=5&pp=iAQB)
+## Summary
+
+This video covers physics-informed machine learning specifically focusing on how to craft loss functions that embed physical knowledge into machine learning models. The presenter explains this as the fourth stage in a five-step physics-informed ML pipeline, emphasizing that custom loss functions are one of the most accessible ways to make ML models more physical, improving their generalization, learning efficiency, and sample efficiency.
+
+> "this is one of the areas that's kind of most commonly used and it's one of the easiest ways of baking physics into a machine learning model which should improve its generalization capabilities uh its learning rates and efficiency sample efficiency lots of benefits and it's a relatively simple straightforward thing you can do to your machine learning model to make it more physical"
+
+## Key Points
+
+- **Physics-Informed Neural Networks (PINNs) add physics constraints through dual loss functions**
+  - > "what pins physics and form neural networks does that's very very clever is they add a second loss function so because of the automatic differentiability uh of these modern machine learning uh environments P torch Jacks uh tensorflow so on you can take these quantities UV WP and you can compute their partial derivatives with respect to space and time"
+
+- **PINNs balance data fitting with physics equation satisfaction**
+  - > "pins adds a loss function it adds an extra loss function here which essentially says how much is the governing physical equation the partial differential equation that governs the physics how accurate is it how much is it violated"
+
+- **Architecture and loss function choices are deeply interconnected**
+  - > "this is a point I'm going to make over and over again architectures and loss functions really go hand in hand so this is an architecture you you are learning a lran you know with these inputs that's an architecture but the thing that allows you to train this architecture and have it make sense and be lran is that you have this loss function"
+
+- **L1 and L2 norms promote different types of physical behavior**
+  - > "we can promote models that are more low dimensional with the two Norm uh and we'll show you how this goes into the loss function in a minute and we can promote models that are simpler have less terms describing them or are more sparse using the one Norm"
+
+- **Physical models should follow the principle of parsimony**
+  - > "your machine learning model should be made as simple as possible to describe the data and no simpler this has been the gold standard of what is physics uh for 2,000 years from Aristotle to Einstein"
+
+## Technical Terms
+
+- **[[Physics-Informed Neural Networks (PINNs)]]**: Neural networks that incorporate physical equations as additional loss terms
+  - > "physics informed neural network or a pin PN uh developed by uh ryy paricus and carneia dois"
+
+- **[[SINDy (Sparse Identification of Nonlinear Dynamics)]]**: Method for discovering governing equations from data using sparse regression
+  - > "I might also want to learn the Dynamics in that Laten space I want to might learn how to those coordinates evolve in time in which case I might use something like the sparse identification of nonlinear Dynamics to find the fewest terms in a differential equation"
+
+- **[[Lagrangian Neural Networks]]**: Networks that learn Lagrangian functions and enforce Euler-Lagrange equations
+  - > "if you know that your system conserves energy like it's a mechanical system like your double pendulum um we know that it does that by either having some lran structure you know it satisfies the oiler lrange equations"
+
+- **[[L1 and L2 Norms]]**: Mathematical measures used to promote sparsity (L1) and smoothness/low-dimensionality (L2)
+  - > "the L2 Norm measures distance in a typical ukian way it's just like the distance between two points as the crow flies the L1 Norm is a little bit of a different uh measure of distance sometimes we call it the the Manhattan Norm or the taxi cab Norm"
+
+## Predictions
+
+- **PINNs will remain popular due to their simplicity**
+  - > "it's probably one of the most popular uh physics and machine learning algorithms out there cuz all you have to do is add this loss function uh and your models become more physical"
+
+## Surprises
+
+- **PINNs never exactly satisfy physics constraints**
+  - > "the downside is that by adding this physics as a term in the loss function is that you're never really going to exactly satisfy that this loss is zero so an actual physical system an actual fluid flow this purple loss should be exactly zero"
+
+- **Complex loss functions require extensive trial and error**
+  - > "this was not an easy loss function to to come up with this took like months of trial and error figuring out how to actually craft the loss function to quantify what we meant by physical in this in this context"
+
+- **Standard least squares regression produces overly complex physics models**
+  - > "I'm going to get a physics model x dot that has 81 terms in the differential equation Y Dot's going to have 81 terms Z dots going to have 81 terms you never open up a physics book and see you know a model that has 81 terms in the differential equation"
+
+## Conclusion
+
+The video demonstrates that crafting appropriate loss functions is crucial for embedding physical knowledge into machine learning models. While methods like PINNs offer accessible entry points, creating truly effective physics-informed loss functions often requires sophisticated understanding of both the underlying physics and optimization principles. The presenter emphasizes that the next stage - optimization - will provide even more rigorous ways of enforcing physical constraints, moving from promoting physics to actually enforcing it.
+
+# AI/ML+Physics Part 5: Employing an Optimization Algorithm [Physics Informed Machine Learning]
+
+![Thumbnail](https://img.youtube.com/vi/T4iJ10TAIMg/maxresdefault.jpg)
+
+👤 [Steve Brunton](https://www.youtube.com/channel/UCm5mt-A4w61lknZ9lCsZtBw)  🔗 [Watch video](https://www.youtube.com/watch?v=T4iJ10TAIMg&list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa&index=6&pp=iAQB)
+## Summary
+
+
+> https://www.youtube.com/playlist?list=PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBaThis lecture covers the fifth stage of physics-informed machine learning: optimization algorithms for training models. The speaker emphasizes that physics can be embedded directly into optimization procedures through constrained optimization, which enforces physics exactly rather than just promoting it through loss functions.
+
+> "if we only added this as a term in the loss function these two terms would be battling each other and you might get a model that's not exactly energy conserving and and uh doesn't have great model error because they're going to be kind of fighting each other but when you do this constrained optimization procedure when you build it directly into the optimization procedure you are minimizing the error and exactly satisfying those constraints"
+
+## Key Points
+
+- **Constrained vs. Penalized Optimization**: Direct constraints in optimization are superior to adding physics terms to loss functions
+  > "with a loss function you're not exactly satisfying your constraints with constrained optimization you are exactly satisfying your constraints"
+
+- **Energy Conservation in Fluid Dynamics**: JC Loiseau showed that incompressible fluid flows have 10 constraint equations that must be satisfied for energy conservation
+  > "because of the incompressibility of this fluid flow from first principles we can derive a set of constraint equations that have to be true for any model that we get of this in compressible fluid flow these 10 constraint equations should be satisfied for energy conservation to hold"
+
+- **Physics-Informed DMD**: Peter Baddoo developed methods to constrain optimization to search over matrices with specific symmetries (Hermitian, self-adjoint, etc.)
+  > "what Peter Badu's physics informed DMD does is it essentially changes the space of matrices that our algorithm is searching over and so this is a constrained optimization where we are essentially constraining our search to a particular manifold of matrices"
+
+- **Sparsity and Low-Dimensionality**: Physics models should be "as simple as possible to describe the data and no simpler," requiring custom optimization algorithms
+  > "physics models models that capture the essential physic of physics of a system tend to be as simple as possible at to describe the data and no simpler"
+
+## Technical Terms
+
+- **[[SINDy]]**: Sparse Identification of Nonlinear Dynamics - a procedure for discovering differential equations from data
+  > "this Cindy sparse model identification procedure where we try to find the fewest Columns of theta that add up to equal x dot y Dot and Z dot"
+
+- **[[KKT Constrained Least Squares]]**: Karush-Kuhn-Tucker optimization method for exactly satisfying equality constraints
+  > "through this KK ke kkt constrainedly squares he's able to exactly satisfy the constraints and minimize this model error"
+
+- **[[Procrustes Problem]]**: Mathematical optimization for restricting solutions to matrix manifolds
+  > "the mathematical optimization problem that allows us to to restrict to these Matrix manifolds is called a procris problem"
+
+- **[[SR3 Algorithm]]**: A new sparse optimization algorithm that finds solutions more efficiently than traditional methods
+  > "using this new kind of sr3 algorithm you almost immediately go to the optimal solution and it's better conditioned"
+
+## Predictions
+
+- Custom optimization algorithms will become increasingly important for physics-informed ML
+  > "sometimes you'll actually need custom algorithms to promote um... sometimes you're going to need custom algorithms to promote a certain type of constraint or to work with a certain type of loss function"
+
+## Surprises
+
+- Many constrained optimization problems have exact closed-form solutions, not just approximate ones
+  > "often times submanifold constraints often also have Exact Solutions so it's not out of the question that you could solve both of these you know uh by designing an optimization algorithm to keep you on this submanifold or Subspace"
+
+- Symbolic regression uses fundamentally different optimization (evolutionary algorithms) compared to typical ML
+  > "this is a whole different set of optimization algorithms that is you know designed to guide this architecture to find the right model"
+
+## Conclusion
+
+The speaker argues that embedding physics directly into optimization algorithms through constrained optimization is the "gold standard" for physics-informed machine learning, though it requires more human effort than simply adding physics terms to loss functions. This approach guarantees exact satisfaction of physical constraints rather than just promoting them.
\ No newline at end of file
diff --git a/pinn/docs/evidence/krishnapriyan2021_failure_modes.md b/pinn/docs/evidence/krishnapriyan2021_failure_modes.md
new file mode 100644
index 0000000..5226fc4
--- /dev/null
+++ b/pinn/docs/evidence/krishnapriyan2021_failure_modes.md
@@ -0,0 +1,1296 @@
+Title: 2109.01050v2.pdf
+
+URL Source: https://arxiv.org/pdf/2109.01050
+
+Published Time: Mon, 23 Jan 2023 11:16:56 GMT
+
+Number of Pages: 20
+
+Markdown Content:
+# Characterizing possible failure modes in physics-informed neural networks 
+
+Aditi S. Krishnapriyan ∗∗,1,2, Amir Gholami ∗,2,Shandian Zhe 3, Robert M. Kirby 3, Michael W. Mahoney 2,41Lawrence Berkeley National Laboratory, 2University of California, Berkeley, 
+
+> 3
+
+University of Utah, 4International Computer Science Institute 
+
+{aditik1, amirgh, mahoneymw}@berkeley.edu, {zhe, kirby}@cs.utah.edu 
+
+Abstract 
+
+Recent work in scientific machine learning has developed so-called physics-informed neural network (PINN) models. The typical approach is to incorporate physical domain knowledge as soft constraints on an empirical loss function and use existing machine learning methodologies to train the model. We demonstrate that, while existing PINN methodologies can learn good models for relatively trivial problems, they can easily fail to learn relevant physical phenomena for even slightly more complex problems. In particular, we analyze several distinct situations of widespread physical interest, including learning differential equations with convection, reaction, and diffusion operators. We provide evidence that the soft regularization in PINNs, which involves PDE-based differential operators, can introduce a number of subtle problems, including making the problem more ill-conditioned. Importantly, we show that these possible failure modes are not due to the lack of expressivity in the NN architecture, but that the PINN’s setup makes the loss landscape very hard to optimize. We then describe two promising solutions to address these failure modes. The first approach is to use curriculum regularization, where the PINN’s loss term starts from a simple PDE regularization, and becomes progressively more complex as the NN gets trained. The second approach is to pose the problem as a sequence-to-sequence learning task, rather than learning to predict the entire space-time at once. Extensive testing shows that we can achieve up to 1-2 orders of magnitude lower error with these methods as compared to regular PINN training. 
+
+1 Introduction 
+
+Partial differential equations (PDEs) are commonly used to describe different phenomena in science and engineering. These PDEs are often derived by starting from governing first principles (e.g., conservation of mass or energy). It is typically not possible to find analytical solutions to these PDEs for many real-world settings. Thus, many different numerical methods (e.g., the finite element method [44 ], pseudo-spectral methods [ 9 ], etc.) have been introduced to approximate their solutions/behavior. However, these PDEs can be quite complex for several settings (e.g., turbulence simulations), and numerical integration techniques, which typically update and improve a candidate solution iteratively until convergence, are often quite computationally expensive. Motivated by this—as well as the increasing quantities of data available in many scientific and engineering applications—there has been recent interest in developing machine learning (ML) approaches to find the solution of the underlying PDEs (and/or work in tandem with numerical solutions). As a result, the area of Scientific Machine Learning (SciML)—which aims to couple traditional scientific mechanistic modeling 
+
+> ∗
+
+Equal contribution 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia. 
+
+> arXiv:2109.01050v2 [cs.LG] 11 Nov 2021
+
+(typically, differential equations) with data-driven ML methodologies (most recently, neural network training)—has emerged. In this vein, there have been a number of ML approaches to incorporate scientific knowledge into such problems while keeping the automatic, data-driven estimates of the solution [2, 17, 33, 39]. A recent line of work involves Physics-Informed Neural Network (PINN) models, which aim to incorporate physical domain knowledge as soft constraints on an empirical loss function, that is then optimized using existing ML training methodologies. To some degree, PINNs are an example of “grafting together” domain-driven models and data-driven methodologies. However, there are important subtleties with this, and we identify several possible failure modes with a naive approach. We then illustrate possible directions for addressing these failure modes. 
+
+Background and problem overview. Many of the problems with a PDE constraint fit the following abstraction: 
+
+F(u(x, t )) = 0 , x ∈ Ω ⊂ Rd, t ∈ [0 , T ], (1) where F is a differential operator representing the PDE, u(x, t ) is the state variable (i.e., parameter of interest), x/t denote space/time, T is the time horizon, and Ω is the spatial domain. Since F is a differential operator, in general one must specify appropriate boundary and/or initial conditions to ensure the existence/uniqueness of a solution to Eq. 1. In the context of PDEs, F can be taxonomized into a parabolic, hyperbolic, or elliptic differential operator [ 23 ]. Quintessential examples of F
+
+include: the convection equation (a hyperbolic PDE), where u(x, t ) could model fluid movement, e.g., air or some liquid, over space and time; the diffusion equation (a parabolic PDE), where u(x, t )
+
+could model the temperature distribution over space and time; and the Laplace equation (an elliptic PDE), where u(x) could model a steady-state diffusion equation, in the limit as t → ∞ .One possible data-driven approach is to incorporate domain information by applying Eq. 1 as a “hard constraint” when training a NN on the data. This can be formulated as the following constrained optimization problem, 
+
+min  
+
+> θ
+
+L(u) s.t. F(u) = 0 , (2) where L(u) is the data-fit term (including initial/boundary conditions), and where F is a constraint on the residual of the PDE system under consideration (i.e., the “physics” knowledge in the equation itself). As mentioned before, for many practical use cases, it is not possible to derive closed form solutions for these problems, and it is often quite difficult to solve problems of the form of Eq. 2, with 
+
+F(u) as a hard constraint. Another (related but different) data-driven approach is to impose the constraint as a “soft constraint” on the outputs of the NN model, 
+
+min  
+
+> θ
+
+L(u) + λF F(u), (3) 
+
+L(u) = Lu0 + Lub . (4) Here, Lu0 and Lub measure the misfit of the NN prediction and the initial/boundary conditions (which are pre-specified/given as input to the problem), and θ denotes the NN parameters (which takes (x, t ), and possibly other quantities, as inputs and then outputs u(x, t )). Furthermore, λF is a regularization parameter that controls the emphasis on the PDE based residual (which we ideally want to be zero). The goal is then to use ML methodologies (stochastic optimization, etc.) to train this NN model to minimize the loss in Eq. 3. In particular, the NN is trained to minimize this modified loss function, where the modification is to penalize the violations of F(u) for some λF ≥ 0.However, even with a large training dataset, this approach does not guarantee that the NN will obey the conservation/governing equations in the constraint Eq. 1. In many SciML problems, these sorts of constraints on the system matter, as they correspond to physical mechanisms of the system. For example, if the conservation of energy equation is only approximately satisfied, then the system being simulated may behave qualitatively differently or even result in unrealistic solutions. We should also note that this approach of incorporating physics-based regularization, where the regu-larization constraint, LF , corresponds to a differential operator, is very different than incorporating much simpler norm-based regularization (such as L1 or L2 regularization), as is common in ML more generally. Here, the regularization operator, LF is non-trivially structured—it involves a differential operator that could actually be ill-conditioned, and it does not correspond to a nice convex set (as does 2a norm ball). Moreover, LF corresponds to actual physical quantities, and there is often an important distinction between satisfying the constraint exactly versus satisfying the constraint approximately (the soft constraint approach doing only the latter). 
+
+Main contributions. The contributions of this paper are as follows: • We analyze PINN models on simple, yet physically relevant, problems of convection, reaction, and reaction-diffusion. We find that the vanilla/regular PINN approach only works for very easy parameter regimes (i.e., small PDE coefficients), but that it fails to learn relevant physics in even moderately more challenging physical regimes, even for problems that have simple closed-form analytical solutions. For many cases, the vanilla PINN approach achieves almost 100% error, as compared to the ground truth solution, even after extensive hyperparameter tuning. (See §3 for details.) • We analyze the loss landscape of trained PINN models and find that adding/increasing the PDE-based soft constraint regularization ( LF in Eq. 3) makes it more complex and harder to optimize, especially for cases with non-trivial coefficients. We also study how the loss landscape changes as the regularization parameter ( λF ) is changed. We find that reducing the regularization parameter can help alleviate the complexity of the loss landscape, but this in turn leads to poor solutions with high errors that do not satisfy the PDE/constraint. (See §4 for details.) • We demonstrate that the NN architecture has the capacity/expressivity to find a good solution, thereby showing that these problems are not due to the limited capacity of the NN architecture. Instead, we argue that the failure is due to optimization difficulties associated with the PINN’s soft PDE constraint. (See §5 for details.) • We propose two paths forward to address these failure modes through (i) curriculum reg-ularization and (ii) posing the learning problem as a sequence-to-sequence learning task. First, in curriculum regularization, we start by imposing the PDE constraint ( LF ) with small coefficients, which are progressively increased to the target problem’s settings as the model gets trained. This gives the NN an opportunity to first train with easier constraints, before it is exposed to the target constraint which could be hard to optimize from the beginning. Second, we show that changing the learning problem to a sequence-to-sequence learning problem can reduce the PINN error, again without any change to the NN architecture. In this setup, the NN is trained on a time segment, instead of the full space-time, which could be more difficult to learn. The task is then to predict the solution and reduce the loss only over smaller time segments. We extensively test both approaches and show that they can reduce the error by up to 1-2 orders of magnitude as compared to regular PINN training, and in many cases can better capture “sharp” features in the solution. (See §5 for details.) • We have open sourced our framework [ 26 ] which is built on top of PyTorch both to help with reproducibility and also to enable other researchers to extend the results. 
+
+2 Related work 
+
+There is a large body of related work, and here we briefly discuss the most related lines of work. 
+
+Machine learning and PDEs. ML approaches for PDE problems have been increasing rapidly in recent years [ 13 , 19 ]. A number of tools and methodologies now exist to solve scientific problems by combining ML and domain insights [ 14 , 20 , 27 , 28 , 38 ]. As mentioned earlier, a popular approach to combine ML and physical knowledge is to include aspects of the PDE term as part of the optimization process via regularization. A notable aspect of such an approach is that the NN can be trained only on data that comes from the governing equation(s) itself (though additional data can be included as well, if available), i.e., with a relatively small amount of data. This has garnered interest and shown successful results in a wide variety of science and engineering problems and applications [ 3 , 11 , 16 , 29 –31 , 43 ]. However, there have also been issues observed with this formulation. For example, it did not work well for stiff ordinary differential equations (ODEs) describing chemical kinetics [ 15 ], for certain heterogeneous media problems [ 7], or for certain fluid flow problems [ 10 ]. Furthermore, PINN models have been analyzed in the context of neural tangent kernels (i.e., towards the infinite width 3limit) to study their convergence [ 36 , 37 ]. This work found some cases where the model failed (such as when the target function exhibits “high frequency features”) and showed some preliminary solutions via the lens of the neural tangent kernel. It has been argued that some of these problems may be due an imbalance in back-propagated gradients in the loss function during training, and a learning-rate annealing scheme has been proposed to mitigate this [35]. 
+
+Physical priors and constraints in NNs. Imposing physical priors and constraints on NN systems is common in SciML problems, as a way to try to enforce a property of interest. This idea has been introduced in different forms in the past (for instance [ 5 , 18 , 25 , 28 , 32 ]). Some approaches have focused on embedding specialized physical constraints into NNs, such as conservation of energy or momentum [ 4, 12 ] or multiscale features [ 34 ]. While methods focusing on constraining the output of the NN are more common, it is difficult to enforce such constraints exactly in ML settings. Previous work has tried to impose hard constraints in ML (both within the context of SciML and otherwise) [ 6 , 21 , 22 , 24 , 40 ], although this can be computationally expensive, and does not guarantee better results or convergence. 
+
+3 Possible failure modes for physics-informed neural networks 
+
+In this section, we highlight several examples where the PINN formulation defined in Eq. 3 does not predict the solution well. We first demonstrate this with two different types of simple, canonical PDE/ODE systems which have simple analytical solutions: convection ( §3.1), and reaction ( §A). We then also include a diffusion component by looking at the reaction-diffusion problem ( §3.2). Note that the convection problem has a linear PDE constraint, and reaction/reaction-diffusion problems both have non-linear PDE terms. 2 We show that PINNs can only learn simple problems with very small parameter values (e.g., small convection or reaction coefficients). We demonstrate that these models fail to learn the relevant physical phenomena for non-trivial cases (e.g., relatively larger coefficients). As we will see, while adding the physical constraint as a soft regularization may be easier to deploy and optimize with existing unconstrained optimization methods, this approach does come with trade-offs, including that in many cases the optimization problem becomes much more difficult to solve. 
+
+Experiment setup. We study both linear and non-linear PDEs/ODEs, and we vary the convection, reaction, and diffusion coefficients for each problem (hereafter, we refer to these as PDE coefficients ). For each problem, we aim to minimize the loss function in Eq. 3. We use a 4-layer fully-connected NN with 50 neurons per layer, a hyperbolic tangent activation function, and randomly sample collocation points ( x, t ) on the domain. Furthermore, all the systems that we consider have periodic boundary conditions. We enforce this through an extra term in the loss function that takes the difference between the predicted NN solution at each boundary. We train this network using the L-BFGS optimizer and sweep over learning rates from 1e −4 to 2.0. 3 After training the PINN, we measure the 
+
+L2 relative and absolute errors between the PINN’s predicted solution and the analytical solution. The L2 relative error is 1
+
+N
+
+∑Ni=0 
+
+|| ˆu − u || 2
+
+|| u || 2
+
+; and the absolute error is 1
+
+N
+
+∑Ni=0 || ˆu − u || 2, where 
+
+N is the number of evaluation grid points, ˆu is the predicted solution by the PINN, and u is the true solution. For all cases, we run models at least ten times with different preset random seeds, and we average the relative and absolute errors in u(x, t ). For each loss function, ˆu is the output of the NN and shorthand for ˆu = N N (θ, x, t ).
+
+> 2Note that for convection, this does not necessarily mean that the mapping from initial to final solution is also linear. This just means that the terms in the PDE are linear.
+> 3For reasons that are only partially understood, L-BFGS methods tend to perform better for existing PINN problems. While variants of stochastic gradient descent are much more popular in computer vision, natural language processing, and recommendation systems, we found that they underperform in comparison to L-BFGS.
+
+410 −4 10 −3 10 −2 10 −1 10 0 10 1 
+
+> 10 −2
+> 10 −1
+> 10 0
+> 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> β
+> Relative error  u(x, t )
+> Absolute error  u(x, t )
+> Relative error Absolute error
+
+(a) Error for different β0.0 0.2 0.4 0.6 0.8 1.0  
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(b) Exact solution for β = 30 0.0 0.2 0.4 0.6 0.8 1.0  
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(c) PINN solution for β = 30 
+
+Figure 1: Prediction error for 1D convection ( §3.1) problem, when β is changed. The PINN has difficulty predicting the solution past a certain timestep, but is able to fit the boundary conditions. Additional figures for different β values can be seen in Fig. C.1. 
+
+3.1 Learning convection Problem formulation. We first consider a one-dimensional convection problem, a hyperbolic PDE which is commonly used to model transport phenomena: 
+
+∂u ∂t + β ∂u ∂x = 0 , x ∈ Ω, t ∈ [0 , T ], (5) 
+
+u(x, 0) = h(x), x ∈ Ω.
+
+Here, β is the convection coefficient and h(x) is the initial condition. For constant β and periodic boundary conditions, this problem has a simple analytical solution: 
+
+uanalytical (x, t ) = F −1(F (h(x)) e−iβkt ), (6) where F is the Fourier transform, i = √−1, and k denotes frequency in the Fourier domain. The general loss function for this problem (corresponding to Eq. 3) is 
+
+L(θ) = 1
+
+NuNu∑
+
+> i=1
+
+(
+
+ˆu − ui
+
+> 0
+
+)2
+
++ 1
+
+NfNf
+
+∑
+
+> i=1
+
+λi
+
+( ∂ ˆu∂t + β ∂ ˆu∂x 
+
+)2
+
++ LB, (7) where ˆu = N N (θ, x, t ) is the output of the NN, and LB is the boundary loss. For periodic boundary conditions with Ω = [0 , 2π), this loss is: 
+
+LB = 1
+
+NbNb∑
+
+> i=1
+
+(
+
+ˆu(θ, 0, t ) − ˆu(θ, 2π, t )
+
+)2
+
+. (8) We use the following simple initial and periodic boundary conditions: 
+
+u(x, 0) = sin (x),u(0 , t ) = u(2 π, t ). (9) 
+
+Observations. We apply the PINN’s soft regularization to this problem, and we optimize the loss function in Eq. 7. After training, we measure the relative and absolute errors between the PINN’s predicted solution and the analytical solution, as reported in Fig. 1(a). As one can see, the PINN is only able to achieve good solutions for small values of convection coefficient, and it fails when β
+
+becomes larger, reaching a relative error of almost 100% for β > 10 . We also provide visualization of the exact and PINN solution in Fig. 1(b-c). One can clearly see that the PINN is unable to learn the solution. As we will later show, the NN architecture does have enough capacity to find the solution, but the training/optimization problem is very difficult to solve with PINNs (and importantly, it may require extensive hyperparameter tuning which is often not feasible in practice). 
+
+3.2 Learning reaction-diffusion Problem formulation. We next look at a reaction-diffusion system, where we add a diffusion operator to the reaction equation discussed above. Note that for pure diffusion, the solution dissipates 50.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(a) Exact solution for ρ = 5 , ν = 5 0.0 0.2 0.4 0.6 0.8 1.0  
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(b) PINN solution for ρ = 5 , ν = 5 
+
+Figure 2: Prediction error for 1D reaction-diffusion ( §3.2) problem. We can clearly see that the PINN has difficulty predicting the solution (especially the “sharpness” of the solution) and is unable to capture the correct behavior. Additional figures for different ν values can be seen in Fig. D.1. 
+
+to a steady-state of uniform/constant distribution, which may be trivial to learn. Therefore, we consider studying the reaction-diffusion system: 
+
+∂u ∂t − ν ∂2u∂x 2 − ρu (1 − u) = 0 , x ∈ Ω, t ∈ (0 , T ], (10) 
+
+u(x, 0) = h(x), x ∈ Ω.
+
+Here, ν (ν > 0) is the diffusion coefficient. The solution of such a system can be solved for via Strang splitting, i.e., splitting the equation into two separate models (a reaction component and a diffusion component): 
+
+du dt = ρu (1 − u)
+
+du dt = ν ∂2u∂x 2 .
+
+(11) For each timestep, we can solve the reaction equation through Eq. 15 (in §A). The diffusion equation has the following analytical solution: 
+
+uanalytical (x, t ) = F −1(F (u(x, t = tn)) e−νk 2t), (12) where u(x, t = tn) is the solution at the nth time step. We solve the reaction equation for each timestep, and then use the reaction solution as the initial condition to solve the diffusion component and get the final solution. The general loss function for this problem is, 
+
+L(θ) = 1
+
+NuNu∑
+
+> i=1
+
+(
+
+ˆu − ui
+
+> 0
+
+)2
+
++1
+
+NfNf
+
+∑
+
+> i=1
+
+λi
+
+( ∂ ˆu∂t − ν ∂2 ˆu∂x 2 − ρˆu(1 − ˆu)
+
+)2
+
++ LB,
+
+(13) where LB is the boundary loss. Similar to the previous example, periodic boundary conditions can be enforced by including LB from Eq. 8 as an extra term in the loss. 
+
+Observations. Similar to the previous case, we can see that the PINN also fails to learn reaction-diffusion. We illustrate a case in Fig. 2 with ρ = 5 , when ν = 5 . The PINN achieves a high relative error of 93%. Here, we can clearly see that the PINN is unable to capture either the reaction or diffusion component. Additional figures for different ν values can be seen in Fig. D.1. In particular, for ν = 2 the PINN achieves a relative error of 50%. Here, we see that it is unable to capture the “sharper” transitions, though it can predict the center of the solution a little better. 
+
+4 Diagnosing possible failure modes for physics-informed NNs 
+
+Thus far, we have shown that PINNs can result in high errors even for simple physical regimes, in particular for PDEs/ODEs with non-trivial convection/reaction/diffusion coefficients. Here, we 61    
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 50 100 150 200
+
+(a) β = 1 .01     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 123456×10 3
+
+(b) β = 10 .01 1.0  
+
+> 0.5
+> 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 12345×10 3
+
+(c) β = 20 .01     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 3.0 ×10 4
+
+(d) β = 30 .01     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 ×10 5
+
+(e) β = 40 .0
+
+β 1 10 20 30 40 Relative error 7.84 × 10 −3 1.08 × 10 −2 7.50 × 10 −1 8.97 × 10 −1 9.61 × 10 −1
+
+Absolute error 3.17 × 10 −3 6.03 × 10 −3 4.32 × 10 −1 5.42 × 10 −1 5.82 × 10 −1
+
+Figure 3: Loss landscapes for varying values of β, for the 1D convection example in §3.1. The loss landscape is more smooth at low β, and it becomes increasingly more complex as β increases, which can make the optimization problem more difficult. In particular, at higher β, the optimizer gets stuck in a certain regime. These results support that adding the PDE soft regularization term results in a more complex optimization loss landscape. 
+
+demonstrate that one of the underlying reasons for this arises due to the PDE-based soft constraint of 
+
+LF , which makes the loss landscape difficult to optimize. We first (in §4.1) analyze the loss landscape to illustrate how increasing this soft regularization can lead to more complex loss landscapes, thus leading to optimization difficulties. We then (in §B) demonstrate how this is related to regularizing with differential operators, which can result in ill-conditioning. 
+
+4.1 Soft PDE regularization and optimization difficulties 
+
+Here, we analyze how the loss landscape changes for different regimes for the convection problem in §3.1 with/without the soft regularization in PINNs. We show that adding the soft regularization can actually make the problem harder to optimize, i.e., the regularization leads to less smooth loss landscapes. For all the experiments, we plot the loss landscape by perturbing the (trained) model across the first two dominant Hessian eigenvectors and computing the corresponding loss values. This tends to be more informative than perturbing the model parameters in random directions [41, 42]. Figure 3 shows the loss landscape for the convection problem (discussed in §3.1), for different β
+
+values. Interestingly, the loss landscape at a relatively low β = 1 is rather smooth, but increasing β
+
+further results in a complex and non-symmetric loss landscape. It is also evident that the optimizer has gotten stuck in a local minima with a very high loss function for large β values. Finally, we study the impact of changing the weight/multiplier for the soft regularization term (i.e., the λ parameter in Eq. 3), which can be relevant in improving PINN performance [ 35 ]. While we find that tuning λ can help change the error, it cannot resolve the problem, as shown in Fig. E.1. Note that as the regularization parameter is increased, the loss landscape becomes increasingly more complex and harder to optimize (additionally, see the z-axis scale). 
+
+5 Expressivity versus optimization difficulty 
+
+In this section, we first show that the failure modes we observed are not necessarily due to the specific NN architecture that we used in our experiments. In particular, we show that the NN model does have the expressivity/capacity to learn the convection/reaction/diffusion coefficient cases where the vanilla PINN method fails. Additionally, in the process of demonstrating this, we also describe two methods that lead to significantly lower error rates. In particular, we show that changing the learning paradigm to curriculum regularization can make the optimization problem easier to solve (as discussed in §5.1). Second, we show that posing the problem as sequence-to-sequence learning may lead to better results than learning the entire state-space at once (as discussed in §5.2). 
+
+5.1 Curriculum PINN Regularization 
+
+One may contend that the failure modes shown in §3 may be because the NN does not have enough capacity. Here, we show that this is not the underlying reason. To do so, we devise a “curriculum 70 
+
+> 5
+> 10
+> 15
+> 20
+> 25
+> 30
+> Training duration
+> β
+> Regular training Curriculum training
+
+(a) Curriculum regularization schematic 0.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(b) Regular training PINN solution for β = 30 0.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(c) Curriculum training PINN solution for β = 30 
+
+Figure 4: Schematic outlining curriculum regularization and example result for 1D convection from §3.1 The training procedure for regular PINNs training versus curriculum PINN training for the convection example in §3.1. The regular PINN training only involves training at β = 30 , while curriculum regularization starts at a lower β, trains a model, and then uses the weights of this model to reinitialize the NN for training the next β. The curriculum training approach is able to do significantly better (by almost two orders of magnitude). 
+
+Regular PINN Curriculum training 1D convection: β = 20 Relative error 7.50 × 10 −1 9.84 × 10 −3
+
+Absolute error 4.32 × 10 −1 5.42 × 10 −3
+
+1D convection: β = 30 Relative error 8.97 × 10 −1 2.02 × 10 −2
+
+Absolute error 5.42 × 10 −1 1.10 × 10 −2
+
+1D convection: β = 40 Relative error 9.61 × 10 −1 5.33 × 10 −2
+
+Absolute error 5.82 × 10 −1 2.69 × 10 −2
+
+Table 1: Training the PINN gradually on more difficult problems improves performance. 1D convection example in §3.1. The curriculum training approach achieves significantly better errors. 
+
+regularization” method to warm start the NN training by finding a good initialization for the weights. Instead of training the PINN to learn the solution right away for cases with higher β/ρ, we start by training the PINN on lower β/ρ (easier for the PINN to learn) and then gradually move to training the PINN on higher β/ρ, respectively. We test these results for the examples in §3.1 and §A. This is somewhat analogous to curriculum learning in ML [ 1 ], but applied by progressively making the PDE/ODE harder to solve. Figure 4 shows the training procedure for an example convection case ( §3.1) with β = 30 . As Fig. 4(c) shows, the curriculum regularization approach results in a much more accurate solution than regular PINN training. With curriculum regularization, the relative error is almost two orders of magnitude lower. Additionally, this is true across all the other regimes that we found regular PINNs to fail, as shown in Tab. 1. In Fig. E.2, we also show that curriculum regularization not only decreases error significantly, but also decreases the variance of the error. In Fig. E.3, we see that curriculum regularization results in a much smoother loss landscape as compared to regular PINN training. Curriculum regularization also works well for the reaction example in §A. In this case, we start by training with a low ρ value (reaction coefficient), and then increase gradually to higher ρ values. The results can be seen in Fig. E.4. We can see that the error is 0.1 - 0.6 orders of magnitude lower for 
+
+ρ = 2 − 4 (when the regular PINN error is not as high), and then greatly decreases error by 1-2 orders of magnitude for ρ = 5 − 10 . As we discussed before, PINN has difficulty in learning sharp features for high values of ρ. However, the curriculum regularization overcomes this, even for ρ = 10 , as seen in Fig. E.4(c). 
+
+5.2 Sequence-to-sequence learning vs learning the entire space-time solution 
+
+The original PINN approach of [ 28 ] trains the NN model to predict the entire space-time at once (i.e., predict u for all locations and time points). In certain cases, this can be more difficult to learn. Here, we demonstrate that it may be better to pose the problem as a sequence-to-sequence (seq2seq) learning task, where the NN learns to predict the solution at the next time step, instead of all times. 80.0 0.2 0.4 0.6 0.8 1.0 
+
+t  
+
+> 0123456
+> x
+> Initial condition points Boundary points Collocation points
+
+(a) Regular PINN training t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t
+
+t 
+
+> 0123456
+> x
+
+(b) Sequence-to-sequence learning (model trained every ∆t)0.0 0.2 0.4 0.6 0.8 1.0 
+
+# t  
+
+> 0123456
+> x
+> Initial condition points Boundary points Collocation points
+
+Figure 5: Schematic outlining seq2seq learning. In contrast to regular PINN training, the solution in seq2seq learning is predicted for only one ∆t step at a time. Then, the predicted solution at t = ∆ t is used as the initial condition for the next segment. To allow fair comparison, we keep the total number of collocation points to be exactly the same in either approach. That is, we do not increase the number of collocation points for seq2seq learning in the right, and keep it to be the same as in the corresponding segment in the left figure. 
+
+Entire state space ∆t = 0 .05 ∆t = 0 .1
+
+ν = 2 , ρ = 5 Relative error 5.07 × 10 −1 2.04 × 10 −2 1.18 × 10 −2
+
+Absolute error 2.70 × 10 −1 1.06 × 10 −2 6.41 × 10 −3
+
+ν = 3 , ρ = 5 Relative error 7.98 × 10 −1 1.92 × 10 −2 1.56 × 10 −2
+
+Absolute error 4.79 × 10 −1 1.01 × 10 −2 8.17 × 10 −3
+
+ν = 4 , ρ = 5 Relative error 8.84 × 10 −1 2.37 × 10 −2 1.59 × 10 −2
+
+Absolute error 5.74 × 10 −1 1.15 × 10 −2 8.01 × 10 −3
+
+ν = 5 , ρ = 5 Relative error 9.35 × 10 −1 2.36 × 10 −2 2.39 × 10 −2
+
+Absolute error 6.46 × 10 −1 1.09 × 10 −2 1.15 × 10 −2
+
+ν = 6 , ρ = 5 Relative error 9.60 × 10 −1 2.81 × 10 −2 2.69 × 10 −2
+
+Absolute error 6.84 × 10 −1 1.17 × 10 −2 1.28 × 10 −2
+
+Table 2: Predicting the entire state space versus discretizing the state space (i.e., seq2seq learning) for 1D reaction-diffusion ( §3.2). The seq2seq learning achieves lower error for both ∆t = 0 .05 and 
+
+∆t = 0 .1, in comparison to the PINN’s approach of predicting the entire state space at once. 
+
+This way, we can use a marching-in-time scheme to predict different sequences/time points. Note that the only data available here is from the PDE itself, i.e., just the initial condition. We take the prediction at t = ∆ t and use this as the initial condition to make a prediction at t = 2∆ t, and so on. This is schematically outlined in Fig. 5. We test this scheme by using the exact same NN architecture as in previous sections, and we report the results in Tab. E.1 for the convection problem of §3.1, Tab. E.2 for the reaction problem of §A, and Tab. 2 for the reaction-diffusion problem of §3.2. We compare the relative/absolute error when the learning is posed as a seq2seq problem (i.e., predicting the state space with a “time marching scheme” of one timestep prediction at a time) to the PINN approach of predicting the whole state space at once. 4
+
+We explore the following cases where the PINN does poorly, varying β, ρ, and ν coefficients: 
+
+> 4
+
+To have a fair comparison between the two methods (time marching versus predicting entire state space at once), for the time marching method, we use the same number of collocation (interior) points for both. For example, for T = [0, 1], if we use 1000 collocation points to predict the entire state space, then for ∆t = 0 .1 we use 100 collocation points per section. 
+
+90.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(a) Exact solution for ρ = 5 ,
+
+ν = 3 0.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(b) Regular PINN prediction for ρ = 5 , ν = 3 0.0 0.2 0.4 0.6 0.8 1.0 
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+
+(c) seq2seq PINN prediction for 
+
+ρ = 5 , ν = 3 
+
+Figure 6: Predicting the entire state space vs seq2seq learning for 1D reaction-diffusion. The regular PINN is unable to capture the “sharp” and/or diffusive features correctly. However, the seq2seq learning approach is able to capture the correct solution, and achieves almost two orders of magnitude lower error. 
+
+1) For 1D convection ( §3.1), higher β values from 30-40. 2) For 1D reaction ( §A), ρ coefficients from 5-10. 3) For 1D reaction-diffusion ( §3.2), a fixed ρ = 5 and ν coefficients from 2-6. For these cases, we find that posing the problem as seq2seq learning results in significantly lower error. The difference is particularly striking for the reaction and reaction-diffusion cases, where the seq2seq PINN model decreases error by almost two orders of magnitude. An example case is shown Fig. 6, where the seq2seq approach is able to recover the solution, while regular PINNs does very poorly. Note that this behavior also has analogues with numerical methods used in scientific computing, where space-time problems are typically harder to solve, as compared to time marching methods [ 8 ]. Intuitively, since the problem is ill-conditioned, restricting the dimensions is expected to help. Furthermore, the underlying function/mapping of the input to the solution should be much simpler to approximate over a smaller time span, as compared to the full time horizon. These initial results are promising, and further developments may lead to still better ways of using PINNs and learning PDEs. In particular, using more sophisticated methods to predict timesteps across the state space may provide improved performance, as may including more sophisticated seq2seq approaches and tuning the regularization parameter (i.e., amount of constraint added). 
+
+6 Conclusions 
+
+PINNs—and SciML more generally—hold great promise for expanding the scope of ML methodology to important problems in science and engineering. For these problems, however, integrating ML methods with PDE-based domain-driven constraints as a soft regularization term can lead to subtle and critical issues. In particular, we show that this approach can have fundamental limitations which results in failure modes for learning relevant physics commonly used in different fields of science. To show this, we picked two fundamental PDE problems of diffusion and convection and showed that the PINN only works for very simple cases, failing to learn the relevant physical phenomena for even moderately more challenging regimes. We then analyzed the problem to characterize the underlying reasons why these failures occur. In particular, we studied the PINN loss landscape behavior and found it becomes it becomes increasingly complex for large values of diffusion or convection coefficients, and with/without non-homogeneous forcing. We also discussed that the problem is not necessarily due to the limited capacity of the NN, but that it is partly an optimization problem resulting in the PDE-based soft constraint used in PINNs. Furthermore, we showed that the PINN approach of solving for the entire space-time at once may not be efficient, and instead posing the problem as a sequence-to-sequence learning task can provide lower error rates. Addressing these and related issues will be critical if we hope to go beyond existing cut-and-paste approaches, toward engineering a more intimate connection between scientific methodologies and ML methodologies. This will be needed to deliver on the promise of PINNs and SciML more generally. 10 7 Acknowledgements. 
+
+We are thankful to Shashank Subramanian for his feedback and contributions. We also acknowledge helpful discussions with Prof. George Biros, Geoffrey Negiar, and Daniel Rothchild. ASK was supported by Laboratory Directed Research and Development (LDRD) funding under Contract Number DE-AC02-05CH11231 at LBNL and the Alvarez Fellowship in the Computational Research Division at LBNL. AG was supported through funding from Samsung SAIT. MWM would also like to acknowledge the UC Berkeley CLTC, ARO, NSF, and ONR. The UC Berkeley team also acknowledges gracious support from Intel corporation, Intel VLAB, Samsung, Amazon AWS, Google Cloud, Google TPU Research Cloud, and Google Brain (in particular Prof. David Patterson, Dr. Ed Chi, and Jing Li). Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. 
+
+References 
+
+[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning , pages 41–48, 2009. [2] S. L. Brunton, B. R. Noack, and P. Koumoutsakos. Machine learning for fluid mechanics. 
+
+Annual Review of Fluid Mechanics , 52:477–508, 2020. [3] Y. Chen, L. Lu, G. E. Karniadakis, and L. Dal Negro. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Optics express , 28(8):11618–11633, 2020. [4] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho. Lagrangian neural networks. arXiv preprint arXiv:2003.04630 , 2020. [5] M. Dissanayake and N. Phan-Thien. Neural-network-based approximations for solving partial differential equations. communications in Numerical Methods in Engineering , 10(3):195–201, 1994. [6] P. L. Donti, D. Rolnick, and J. Z. Kolter. Dc3: A learning method for optimization with hard constraints. arXiv preprint arXiv:2104.12225 , 2021. [7] V. Dwivedi, N. Parashar, and B. Srinivasan. Distributed learning machines for solving forward and inverse problems in partial differential equations. Neurocomputing , 420:299–316, 2021. [8] K. Eriksson, D. Estep, P. Hansbo, and C. Johnson. Computational differential equations .Cambridge University Press, 1996. [9] B. Fornberg. A practical guide to pseudospectral methods . Cambridge university press, 1998. [10] O. Fuks and H. A. Tchelepi. Limitations of physics informed machine learning for nonlinear two-phase transport in porous media. Journal of Machine Learning for Modeling and Computing , 1 (1), 2020. [11] N. Geneva and N. Zabaras. Modeling the dynamics of pde systems with physics-constrained deep auto-regressive networks. Journal of Computational Physics , 403:109056, 2020. [12] S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. Advances in Neural Information Processing Systems , 32:15379–15389, 2019. [13] J. Han, A. Jentzen, and E. Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018. [14] O. Hennigh, S. Narasimhan, M. A. Nabian, A. Subramaniam, K. Tangsali, Z. Fang, M. Rietmann, W. Byeon, and S. Choudhry. Nvidia simnet ™: An ai-accelerated multi-physics simulation framework. In International Conference on Computational Science , pages 447–461. Springer, 2021. [15] W. Ji, W. Qiu, Z. Shi, S. Pan, and S. Deng. Stiff-pinn: Physics-informed neural network for stiff chemical kinetics. arXiv preprint arXiv:2011.04520 , 2020. 11 [16] X. Jin, S. Cai, H. Li, and G. E. Karniadakis. Nsfnets (navier-stokes flow nets): Physics-informed neural networks for the incompressible navier-stokes equations. Journal of Computational Physics , 426:109951, 2021. [17] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang. Physics-informed machine learning. Nature Reviews Physics , 3(6):422–440, 2021. [18] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks , 9(5):987–1000, 1998. [19] Z. Long, Y. Lu, X. Ma, and B. Dong. Pde-net: Learning pdes from data. In International Conference on Machine Learning , pages 3208–3216. PMLR, 2018. [20] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis. Deepxde: A deep learning library for solving differential equations. SIAM Review , 63(1):208–228, 2021. [21] L. Lu, R. Pestourie, W. Yao, Z. Wang, F. Verdugo, and S. G. Johnson. Physics-informed neural networks with hard constraints for inverse design. arXiv preprint arXiv:2102.04626 , 2021. [22] P. Márquez-Neila, M. Salzmann, and P. Fua. Imposing hard constraints on deep networks: Promises and limitations. arXiv preprint arXiv:1706.02025 , 2017. [23] P. Moin. Fundamentals of engineering numerical analysis . Cambridge University Press, 2010. [24] Y. Nandwani, A. Pathak, P. Singla, et al. A primal dual formulation for deep learning with constraints. Advances in Neural Information Processing Systems , 2019. [25] D. R. Parisi, M. C. Mariani, and M. A. Laborde. Solving differential equations with unsupervised neural networks. Chemical Engineering and Processing: Process Intensification , 42(8-9):715– 721, 2003. [26] C. possible failure modes in physics-informed neural networks. https://github.com/a1k12/characterizing-pinns-failure-modes, 2021. [27] C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar, D. Skinner, A. Ramad-han, and A. Edelman. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385 , 2020. [28] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics , 378:686–707, 2019. [29] M. Raissi, A. Yazdani, and G. E. Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science , 367(6481):1026–1030, 2020. [30] F. Sahli Costabal, Y. Yang, P. Perdikaris, D. E. Hurtado, and E. Kuhl. Physics-informed neural networks for cardiac activation mapping. Frontiers in Physics , 8:42, 2020. [31] J. Sirignano and K. Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of computational physics , 375:1339–1364, 2018. [32] B. P. van Milligen, V. Tribaldos, and J. Jiménez. Neural network differential equation and plasma equilibrium solver. Physical review letters , 75(20):3594, 1995. [33] L. von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrom-mer, A. Pick, R. Ramamurthy, et al. Informed machine learning–a taxonomy and survey of integrating knowledge into learning systems. arXiv preprint arXiv:1903.12394 , 2019. [34] B. Wang, W. Zhang, and W. Cai. Multi-scale deep neural network (mscalednn) methods for oscillatory stokes flows in complex domains. arXiv preprint arXiv:2009.12729 , 2020. [35] S. Wang, Y. Teng, and P. Perdikaris. Understanding and mitigating gradient pathologies in physics-informed neural networks. arXiv preprint arXiv:2001.04536 , 2020. 12 [36] S. Wang, H. Wang, and P. Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale pdes with physics-informed neural networks. arXiv preprint arXiv:2012.10047 , 2020. [37] S. Wang, X. Yu, and P. Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective. arXiv preprint arXiv:2007.14527 , 2020. [38] E. Weinan, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equa-tions. Communications in Mathematics and Statistics , 5(4):349–380, 2017. [39] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar. Integrating physics-based modeling with machine learning: A survey. arXiv preprint arXiv:2003.04919 , 2020. [40] K. Xu and E. Darve. Physics constrained learning for data-driven inverse modeling from sparse observations. arXiv preprint arXiv:2002.10521 , 2020. [41] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems , 2018. [42] Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE International Conference on Big Data (Big Data) , pages 581–590. IEEE, 2020. [43] Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, and P. Perdikaris. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. 
+
+Journal of Computational Physics , 394:56–81, 2019. [44] O. C. Zienkiewicz, R. L. Taylor, P. Nithiarasu, and J. Zhu. The finite element method , volume 3. McGraw-hill London, 1977. 13 10 0 10 1 
+
+> 10 −1
+> 10 0
+> 10 −2
+> 10 −1
+> 10 0
+> ρ
+> Relative error  u(x, t )
+> Absolute error  u(x, t )
+> Relative error Absolute error
+
+(a) Error for different ρ0.0 0.2 0.4 0.6 0.8 1.0  
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(b) Exact solution for ρ = 5 0.0 0.2 0.4 0.6 0.8 1.0  
+
+> t
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(c) PINN predicted solution for ρ = 5 
+
+Figure A.1: Prediction error for 1D reaction ( §A) problem, when ρ is changed. The PINN has difficulty predicting any part of the solution (and can’t even predict the first few timesteps), and only predicts a homogeneous solution everywhere. 
+
+A Learning reaction 
+
+We include an additional example studying the reaction case, to complement the additional studies in §3. 
+
+Problem formulation. We consider an example of a one-dimensional reaction equation, which is commonly used to model chemical reactions. We look at the reaction term in Fisher’s equation, which is a semi-linear ordinary differential equation: 
+
+∂u ∂t − ρu (1 − u) = 0 , x ∈ Ω, t ∈ (0 , T ], (14) 
+
+u(x, 0) = h(x), x ∈ Ω.
+
+Here, ρ is the reaction coefficient and h(x) is the initial condition. The reaction problem has a simple analytical solution for periodic boundary conditions and constant ρ:
+
+uanalytical (x, t ) = h(x)eρt 
+
+h(x)eρt + 1 − h(x) . (15) The general loss function for this problem is 
+
+L(θ) = 1
+
+NuNu∑
+
+> i=1
+
+(
+
+ˆu − ui
+
+> 0
+
+)2
+
++ 1
+
+NfNf
+
+∑
+
+> i=1
+
+λi
+
+( ∂ ˆu∂t − ρˆu(1 − ˆu)
+
+)2
+
++ LB, (16) where LB is the boundary loss (same as in Eq. 8). We consider the following initial and boundary con-ditions: 
+
+u(x, 0) = e− (x−π)22( π/ 4) 2 ,u(0 , t ) = u(2 π, t ).
+
+(17) 
+
+Observations. We report the relative/absolute error of the PINN with respect to the ground truth in Fig. A.1. Similar to the convection case, we can see that the PINN is only able to learn the problem for very small values of the reaction coefficient, ρ. However, the error quickly gets to 100% as ρ is increased. The example heatmap in Fig. A.1(c) shows that the PINN is not able to predict the solution at all, and instead, it predicts a mostly homogeneous solution, close to zero, everywhere. 
+
+B A PDE perspective on ill-conditioned regularization 
+
+One of the difficulties with PINNs arises from the soft regularization term that includes differential operators. This term is quite different from norm-based regularization that is more common in ML, and this can actually make the problem more ill-conditioned (or less regularized). From a PDE 14 perspective, this is not surprising as the PDE-based regularization operator (i.e., LF term) in the PINN can in fact be ill-conditioned, which can lead to unstable numerical behavior. For example, in numerical PDE analysis, it is well-known that the condition number for the regularization operator for the diffusion problem is O(νN 2)2, where N is the grid size (see §B.1 for the derivation). Similarly, the condition number for the convection problem scales as O(βN )2, which is still quite high. As such, it is not surprising that this ill-conditioned property would lead to instability which can manifest itself in large gradients and/or poor convergence behavior. (Similar results were also reported in [ 35 ].) We should however emphasize that the condition number is only one of the many factors involved in this problem, and other factors such as the complexity of the function that we are trying to approximate, the non-convex loss landscape, and the limitations of optimization algorithms need to be considered. To give an example, a highly diffusive PDE with a very large diffusion coefficient quickly results in a solution that approaches uniform/constant distribution, which can be very easy to approximate with trivial initial conditions. However, as with other ill-conditioned operators, it is in the presence of noise, or cases with non-trivial physics, that the instability appears. 
+
+B.1 Approximate condition number scaling for PDE regularization in PINNs 
+
+We provide a derivation to obtain an approximate condition number for the PINN regularization term. First, note that the condition number of an operator is a metric for the amount that a function’s output changes for a change in its input. Using this definition, we can obtain an approximate bound on the condition number for the PINN regularization operator. We should emphasize that the non-linear nature of the regularization makes the exact condition number behavior dependent on the state variable and its derivatives, and so the results obtained below are approximate. 
+
+Convection problem. For the convection problem, the PDE-based regularization operator is: 
+
+f = du dt + β ∂u ∂x . (18) For a small change in δu to input, the output will change as follows: 
+
+δf = dδu dt + β ∂δu ∂x . (19) We can estimate that the maximum change in output (denoted as δf ) is proportional to the sum of the maximum change in the first time derivative, which scales as O(δt −1), and the second term, which scales as O(βh −1). Here, δt is the time step size and h = 1 /N is the grid size spacing for a uniformly discretized grid of size N . Assuming that the time step size is not very small, the condition number will be proportional to the second term and will scale as O(βN ). Since the PINN regularization term uses L2 loss, this change will be quadratically scaled. 
+
+Diffusion problem. Similarly, we can obtain an approximate scaling of the condition number for the diffusion case where the regularization function has the following form: 
+
+f = du dt − ν ∂2u∂x 2 . (20) The corresponding change in the output, for a perturbation in u, is then: 
+
+δf = dδu dt − ν ∂2δu ∂x 2 . (21) Similar to the previous case, we can estimate that the maximum change in output (denoted as δf ) is proportional to the sum of the maximum change in the first time, which again scales as O(δt −1), and the second term, which scales as O(νh −2). Again, assuming that the time step size is not very small, the condition number will be proportional to the second term and will scale as O(νN 2). Similarly, since the PINN regularization term uses L2 loss, this change will be quadratically scaled. Also, we observed that error increases more rapidly when ν increases (for the diffusion case), in contrast to the slower error increase with respect to β for the convection case. Finally, as our estimates 15 of the condition number show, the condition number for diffusion scales as N 2 while convection scales as N .
+
+C Learning convection 
+
+We include additional heatmaps for learning convection ( §3.1). 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(a) Exact solution for β = 10 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(b) PINN solution for β = 10 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.0025
+> 0.0050
+> 0.0075
+> 0.0100
+> 0.0125
+> 0.0150
+> 0.0175
+> 0.0200
+
+(c) Difference between predicted and exact solution for β = 10 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(d) Exact solution for β = 30 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(e) PINN solution for β = 30 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+> 1.25
+> 1.50
+> 1.75
+
+(f) Difference between predicted and exact solution for β = 30 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(g) Exact solution for β = 50 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(h) PINN solution for β = 50 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(i) Difference between predicted and exact solution for β = 50 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(j) Exact solution for β = 70 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 1.00
+> 0.75
+> 0.50
+> 0.25
+> 0.00
+> 0.25
+> 0.50
+> 0.75
+> 1.00
+
+(k) PINN solution for β = 70 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t 
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(l) Difference between predicted and exact solution for β = 70 
+
+Figure C.1: Heatmap of exact vs predicted solution for 1D convection ( §3.1). Heatmap of the exact solutions to the 1D convection equation, Eq. 5, for a variety of β values and the respective PINN predicted solution. The PINN is unable to predict the solution as β increases, past a certain timestep. 
+
+16 D Learning reaction-diffusion 
+
+We include additional heatmaps for learning reaction-diffusion ( §3.2). 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(a) Exact solution for ρ = 5 ,
+
+ν = 2 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(b) PINN solution for ρ = 5 ,
+
+ν = 2 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(c) Exact solution for ρ = 5 ,
+
+ν = 3 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(d) PINN solution for ρ = 5 ,
+
+ν = 3 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(e) Exact solution for ρ = 5 ,
+
+ν = 4 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(f) PINN solution for ρ = 5 ,
+
+ν = 4 
+
+Figure D.1: Heatmap of exact vs predicted solution for 1D reaction-diffusion ( §3.2). Heatmap of the exact solutions to the 1D reaction-diffusion equation, Eq. 10, for different ν values and the respective physics-informed NN predicted solution. The PINN is unable to predict the solution, including both capturing the “sharp” features and/or diffuse features. 
+
+17 E Extra Results 
+
+E.1 Extra results for loss landscapes when varying the λ parameter 1    
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 50 100 150 200 250
+
+(a) λ = 1 × 10 −61 1.0  
+
+> 0.5
+> 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 50 100 150 200 250 300 350 400
+
+(b) λ = 1 × 10 −51     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 ×10 3
+
+(c) λ = 1 × 10 −31     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 ×10 4
+
+(d) λ = 1 × 10 −11     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 3.0 ×10 5
+
+(e) λ = 1 × 10 1
+
+λ 1 × 10 −6 1 × 10 −5 1 × 10 −3 1 × 10 −1 1 × 10 1
+
+Relative error 1.69 1.65 1.00 1.08 0.982 Absolute error 0.987 0.987 0.623 0.647 0.595 
+
+Figure E.1: Loss landscapes when varying the λ parameter in F, for the 1D convection equation in §3.1 . In this example, β = 30 , which is a point at which the error is high. The loss landscape becomes more complex as λ is increased, i.e., as the regularization term grows. However, error stays consistently high (although it decreases a little as λ is increased). 
+
+E.2 Extra curriculum regularization results 
+
+We include extra curriculum regularization results for §3.1 in Fig. E.2. These results demonstrate that curriculum regularization not only decreases the error in the solution, but also decreases the variance in the error. 10 −2 10 −1 10 0
+
+Regular β = 10 
+
+Curriculum β = 10 
+
+Regular β = 20 
+
+Curriculum β = 20 
+
+Regular β = 30 
+
+Curriculum β = 30 
+
+Regular β = 40 
+
+Curriculum β = 40 
+
+(a) Relative error (log scale) for different β10 −2 10 −1 10 0
+
+Regular β = 10 
+
+Curriculum β = 10 
+
+Regular β = 20 
+
+Curriculum β = 20 
+
+Regular β = 30 
+
+Curriculum β = 30 
+
+Regular β = 40 
+
+Curriculum β = 40 (b) Absolute error (log scale) for different β
+
+Figure E.2: Training the PINN gradually on more difficult problems improves performance. 1D con-vection example in §3.1. Summary of performance across 10 preset random seeds (with lowest error per seed) to show the variance in error. The curriculum learning approach achieves significantly better errors, as well as lower variance in the error. 
+
+We include extra curriculum regularization results for the convection example in §3.1 in Fig. E.3, showing how the loss landscape becomes smoother with the curriculum regularization approach. We also include extra curriculum regularization results for the reaction example in §A in Fig. E.4. These results further demonstrate that curriculum regularization greatly decreases error in the solution. 18 1    
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 3.0 ×10 4
+
+(a) Regular PINN training for β = 30 .01     
+
+> 1.0 0.5 0.0 0.5 1.0
+> 2
+> 1.0
+> 0.5
+> 0.0 0.5 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 ×10 4
+
+(b) Curriculum learning training for 
+
+β = 30 .0
+
+Figure E.3: Loss landscapes for regular PINN training versus curriculum learning training. The loss landscape is much smoother for the curriculum learning approach. 10 0 10 1  
+
+> 10 −1
+> 10 0
+> 10 −2
+> 10 −1
+> 10 0
+> ρ
+> Relative error  u(x, t )
+> Absolute error  u(x, t )
+> Regular training relative error Curriculum training relative error
+> Regular training absolute error Curriculum training absolute error
+
+(a) Error for regular training vs. curriculum training 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(b) Regular training PINN solution for ρ = 10 0.0 0.2 0.4 0.6 0.8 1.0 
+
+t
+
+> 0
+> 1
+> 2
+> 3
+> 4
+> 5
+> 6
+> x
+> 0.2
+> 0.4
+> 0.6
+> 0.8
+> 1.0
+
+(c) Curriculum training PINN solution for ρ = 10 
+
+Figure E.4: Curriculum regularization for 1D reaction from §A . The PINN is now able to predict the solution much more closely, including capturing the “sharp” features (traditionally hard for PINNs), and error is 1-2 orders of magnitude lower when using curriculum training over regular training. 
+
+E.3 Extra sequence-to-sequence learning results 
+
+We include sequence-to-sequence learning results for convection ( §3.1) in Tab. E.1. Entire state space ∆t = 0 .05 ∆t = 0 .1
+
+β = 30 Relative error 7.38 × 10 −1 2.13 × 10 −1 1.05 × 10 −1
+
+Absolute error 5.57 × 10 −1 1.29 × 10 −1 5.95 × 10 −2
+
+β = 40 Relative error 8.25 × 10 −1 4.58 × 10 −1 2.41 × 10 −1
+
+Absolute error 6.06 × 10 −1 2.58 × 10 −1 1.35 × 10 −1
+
+Table E.1: Predicting the entire state space versus discretizing the state space (i.e., seq2seq learning) for 1D convection ( §3.1). The seq2seq learning achieves lower error for both ∆t = 0 .05 and ∆t = 0 .1,in comparison to the PINN’s approach of predicting the entire state space at once. For this example only, we use a higher number of collocation points (10000) for regular PINNs and seq2seq learning, which minimizes the variance in the seq2seq results. 
+
+We include sequence-to-sequence learning results for reaction ( §A) in Tab. E.2. 19 Entire state space ∆t = 0 .05 ∆t = 0 .1
+
+ρ = 5 Relative error 9.79 × 10 −1 7.06 × 10 −2 7.09 × 10 −2
+
+Absolute error 5.40 × 10 −1 2.52 × 10 −2 2.39 × 10 −2
+
+ρ = 6 Relative error 9.88 × 10 −1 8.25 × 10 −2 7.78 × 10 −2
+
+Absolute error 5.88 × 10 −1 3.02 × 10 −2 2.65 × 10 −2
+
+ρ = 7 Relative error 9.92 × 10 −1 8.16 × 10 −2 7.56 × 10 −2
+
+Absolute error 6.31 × 10 −1 3.03 × 10 −2 2.69 × 10 −2
+
+ρ = 8 Relative error 9.94 × 10 −1 8.19 × 10 −2 7.44 × 10 −2
+
+Absolute error 6.69 × 10 −1 3.10 × 10 −2 2.73 × 10 −2
+
+ρ = 9 Relative error 9.95 × 10 −1 7.02 × 10 −2 8.63 × 10 −2
+
+Absolute error 7.02 × 10 −1 2.83 × 10 −2 3.21 × 10 −2
+
+ρ = 10 Relative error 9.96 × 10 −1 6.88 × 10 −2 7.47 × 10 −2
+
+Absolute error 7.31 × 10 −1 2.85 × 10 −2 2.85 × 10 −2
+
+Table E.2: Predicting the entire state space versus discretizing the state space (i.e., seq2seq learning) for 1D reaction ( §A). The seq2seq learning achieves lower error for both ∆t = 0 .05 and ∆t = 0 .1, in comparison to the PINN’s approach of predicting the entire state space at once. 
+
+20
diff --git a/pinn/docs/evidence/sukumar2022_exact_bc_distance.md b/pinn/docs/evidence/sukumar2022_exact_bc_distance.md
new file mode 100644
index 0000000..70e5843
--- /dev/null
+++ b/pinn/docs/evidence/sukumar2022_exact_bc_distance.md
@@ -0,0 +1,2541 @@
+Title: Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks
+
+URL Source: https://arxiv.org/pdf/2104.08426
+
+Published Time: Mon, 23 Jan 2023 09:31:57 GMT
+
+Number of Pages: 50
+
+Markdown Content:
+# Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks 
+
+N. Sukumar a, ∗, Ankit Srivastava b
+
+> aDepartment of Civil and Environmental Engineering, University of California, Davis, CA 95616, USA
+> bDepartment of Mechanical, Materials, and Aerospace Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA
+
+Abstract 
+
+In this paper, we introduce a new approach based on distance fields to exactly impose boundary conditions in physics-informed deep neural networks. The challenges in satisfying Dirichlet boundary conditions in meshfree and particle methods are well-known. This issue is also pertinent in the development of physics informed neural networks (PINN) for the solution of partial di ff erential equations. We introduce geometry-aware trial functions in artifical neural net-works to improve the training in deep learning for partial di ff erential equations. To this end, we use concepts from constructive solid geometry (R-functions) and generalized barycentric coordinates (mean value potential fields) to construct φ(x), an approximate distance function to the boundary of a domain in Rd. To exactly impose homoge-neous Dirichlet boundary conditions, the trial function is taken as φ(x) multiplied by the PINN approximation, and its generalization via transfinite interpolation is used to a priori satisfy inhomogeneous Dirichlet (essential), Neumann (natural), and Robin boundary conditions on complex geometries. In doing so, we eliminate modeling error associated with the satisfaction of boundary conditions in a collocation method and ensure that kinematic admissibility is met pointwise in a Ritz method. With this new ansatz, the training for the neural network is simplified: sole contribution to the loss function is from the residual error at interior collocation points where the governing equation is required to be satisfied. Numerical solutions are computed using strong form collocation and Ritz minimization. To convey the main ideas and to assess the accuracy of the approach, we present numerical solutions for linear and nonlinear boundary-value problems over convex and nonconvex polygonal domains as well as over domains with curved bound-aries. Benchmark problems in one dimension for linear elasticity, advection-di ff usion, and beam bending; and in two dimensions for the steady-state heat equation, Laplace equation, biharmonic equation (Kirchho ff plate bending), and the nonlinear Eikonal equation are considered. The construction of approximate distance functions using R-functions extends to higher dimensions, and we showcase its use by solving a Poisson problem with homogeneous Dirichlet boundary conditions over the four-dimensional hypercube. The proposed approach consistently outperforms a stan-dard PINN-based collocation method, which underscores the importance of exactly (a priori) satisfying the boundary condition when constructing a loss function in PINN. This study provides a pathway for meshfree analysis to be conducted on the exact geometry without domain discretization. 
+
+Keywords: deep learning, meshfree method, distance function, R-function, transfinite interpolation, exact geometry 
+
+1. Introduction 
+
+Machine learning algorithms based on supervised learning (deep neural networks) are relatively mature in fields such as computer vision, image processing, and speech recognition. Some of the earliest studies on physics-informed neural networks (PINN) to solve boundary-value problems can be traced to the contributions of Lagaris et al. [1– 3], and those of McFall [4] and McFall and Mahan [5]. These studies have provided the impetus for the recent interest and advancement of the approach. Over the past 3–4 years there have been many new developments in a meshfree approach that is based on PINN to solve low-and high-order partial di ff erential equations (PDEs). Some of the main contributions in this thread are based on: collocation [6–8], variational principle (deep Ritz) [9, 10], and Petrov-Galerkin domain-decomposition [11, 12]. Lu et al. [13] present an overview of PINN for the solution    
+
+> ∗Corresponding authors
+> Email address: nsukumar@ucdavis.edu (N. Sukumar)
+> To Appear in Computer Methods in Applied Mechanics and Engineering November 9, 2021
+> arXiv:2104.08426v2 [math.NA] 7 Nov 2021
+
+of PDEs. These developments have been made possible due to the advances in automatic di ff erentiation tools that effi ciently compute the derivatives of nonlinear composite functions, and in stochastic gradient descent algorithms that can deliver accurate solutions for nonlinear, nonconvex optimization problems. Furthermore, the availability of public-domain data analysis packages such as Tensorflow [14] and PyTorch [15] has also decreased the barriers to entry for newcomers to this field. Within the purview of solving PDEs with neural networks, it is instructive to highlight that unlike well-known computational methods (finite elements and its generalizations, finite volume, boundary elements, meshfree, and others), where function approximation is based on linear combinations of basis functions, which have to be chosen and defined a priori, function approximation in deep neural networks is through nonlinear function composition of an activation function σ : Rd → R, which yields the best approximation function via the solution procedure. A popular choice for σ is: σ(x) : = ReLU (x) = max(0 , x), which is known as the Rectified Linear Unit (ReLU) activation function. The collocation method using PINN is based on minimizing the least squares residual error (nonlinear and nonconvex mean squared error loss function) in order to satisfy the PDE and the boundary conditions at collocation points—whose solution yields the parameters that define the approximation function in the artificial neural network. In PINN-based deep collocation [6] and deep Ritz [9] methods, the inexact imposition of boundary conditions adversely a ff ects the training of the neural network as well as the accuracy of the method [16–18]. On complex ge-ometries, this shortcoming becomes more acute. This issue has plagued meshfree Galerkin methods [19, 20] since their inception, and a foolproof solution is still unavailable for arbitrary two- and three-dimensional geometries. Meshfree basis functions are also nonpolynomial, and hence an additional hurdle is a suitable cell-based numerical integration scheme that is consistent (patch test is passed) and stable (absence of spurious modes) for nonlinear simulations. Since solving for the approximating function is part of the minimization procedure in PINN, a simple (e.g., Monte Carlo) integration scheme for the entire domain can be used. Hence, if a reliable approach to impose boundary conditions on complex geometries in PINN is realized, it can lead to an accurate and robust meshfree method. In this paper, we tackle this problem by proposing a new approach in PINN to exactly impose boundary conditions, with an eye towards enabling meshfree simulations over complex geometries in Rd (d = 2, 3). Our main contributions are as follows: 1. We introduce a new geometry-aware method in physics-informed neural networks that uses R-functions and transfinite interpolation to exactly impose boundary conditions over complex (a ffi ne, curved and multiply-connected) geometries. This geometry-aware approach, which is based on the construction of approximation distance functions (ADFs) to boundary sets, was first proposed by Kantorovich to satisfy Dirichlet boundary conditions [21], and has been extended using the theory of R-functions to exactly impose Dirichlet, Neumann and Robin boundary conditions in a meshfree method by initially Rvachev [22], and then subsequently by Rvachev and coworkers [23–25] and Shapiro and coworkers [26–31]. A related meshfree approach in the spirit of Kantorovich’s method is that of H¨ ollig et al. [32], who used web-splines that are multiplied by a weight func-tion (approximates the distance function) to solve a Dirichlet problem. We also draw from a previous study [33], where R-functions are used to define smooth approximate distance fields over polygonal domains. 2. Approximate distance fields that stem from mean value potentials [34–36] are also used in PINN to solve PDEs. 3. On exactly satisfying boundary conditions in physics-informed neural networks, the training of the network is simplified, and this facilitates convergence and improved accuracy of the PINN approximation. 4. New application of R-functions in R4: solution of the Poisson equation over the four-dimensional hypercube. 5. Solving PDEs on curved domains without discretizing the domain (mesh generation) is realized, which provides a pathway to conducting meshfree simulations on the exact geometry (isogeometric analysis) [37]. First, we present a few connections of prior work on finite elements and partition-of-unity meshfree methods to better understand and place the PINN approximation and its use to solve PDEs. Contributions in r-adaptivity with finite elements have a long history in computational mechanics. For instance, in finite-deformation simulations using finite elements, the optimal nodal locations and the solution coe ffi cients have both been simultaneously treated as unknowns in the minimization of the potential energy functional [38]. Since the PINN approximation that is composed by the ReLU activation function can exactly represent piecewise a ffi ne functions (Delaunay basis functions) [39], one can view the ReLU network solution as a variational r-adaptive finite element solution procedure. Instead of refining elements in h-adaptive finite elements, adaptive solutions can be realized via a basis refinement strategy that has advantages (for example, “hanging nodes” are a nonissue), which was put forth by Grinspun [40], and a similar approximation refinement perspective can be associated with a multilayer neural network [41]. The connections between ReLU networks and hp -finite elements are studied in Opschoor et al. [42]. Initiated by Kansa [43, 44], meshfree collocation schemes with positive-definite radial basis functions (RBFs) such as the Gaussian and Hardy’s 2inverse multiquadrics have been used to solve PDEs [45, 46]. Schaback and Wendland [47] discuss the ties of kernel methods (for example, radial basis functions) to machine learning and meshfree methods. The choice on how to set the “shape parameter” in the Gaussian RBF (controls the support-width of the Gaussian) is an unsettled issue since it is problem dependent and must be carefully selected for boundary and interior points if exponential convergence is to be maintained without exacerbating the condition number. A related approach is the local maximum-entropy (max-ent) meshfree method [48] that yield compactly-supported basis functions of exponential form that constitute a partition-of-unity [49], possess linear completeness, and provide a smooth transition from Delaunay basis functions [50] to global maximum-entropy basis functions [51]. Consider a nodal set with nodes (centers) that are located at {xi}ni=1. When viewed through the lens of Gaussian weight functions [52–54], a single parameter {βi}ni=1 controls the support-width of each Gaussian weight function. Rosolen et al. [55] proposed a variational adaptivity formulation to find optimal values of {βi}ni=1 that minimize the potential energy functional for Poisson equation and nonlinear elasticity. Since RBFs can be represented in a neural network with a single hidden layer [56, 57], the neural network solution optimizes the location of the centers {xi}ni=1 as well as the support-widths {βi}ni=1 [56]. RBF-based partition-of-unity networks [58] for hp -approximation have been introduced, and numerical experiments have been conducted using sparse Gaussian networks to solve PDEs [59]. Lastly, we mention the recent work of Greco and Arroyo [60], who presented a collocation scheme for PDEs that is based on high-order max-ent approximants, which delivered accurate simulation results on domains with a ffi ne and curved boundaries. In Section 9.1.7, we present a one-dimensional example using Gaussian neural networks. Since PINN a ff ords significant flexibility vis-` a-vis existing meshfree (basis-set) methods, solving PDEs over complex geometries using collocation and Ritz methods with artificial neural networks holds significant promise. In a strong collocation PINN method, the loss function consists of the residual errors from the interior of the domain, which is known as interior or PDE loss , and from the boundaries of the domain, which is referred to as 
+
+boundary (conditions) loss [6]. There are three distinct contributions in the mean squared error loss function: (1) residual error at interior collocation points where the PDE must be satisfied; (2) residual error at boundary collocation points where the essential (Dirichlet) boundary condition must be satisfied; and (3) residual error at boundary collo-cation points where the Robin or natural (Neumann) boundary condition must be satisfied. Early approaches [1, 3–5] had already recognized the importance of exact imposition of boundary conditions in artificial neural networks. La-garis et al. [1] considered two terms in the trial function, with the first term being an analytical function that exactly imposed the boundary conditions and the second term was chosen as the product of the PINN approximation and a function that vanished on the boundary; for irregular boundaries, Lagaris et al. [3] used a RBF network in the first term to satisfy the boundary conditions at a collection of discrete points on the boundary. McFall et al. [4, 5] and more recently Sheng and Yang [61] introduced a length factor (measure of the distance to the boundary) associated with the boundary to impose boundary conditions, and Berg and Nystr¨ om [7] approximated the distance function using a low-capacity neural network to impose boundary conditions over complex geometries. In many recent stud-ies [16–18, 62], the implications of imposing essential boundary conditions via the loss function in PINN have been studied, and numerical experiments have a ffi rmed that the presence of the boundary residual terms compromises the convergence of the stochastic gradient descent algorithm and the accuracy of the method. To address this problem, remedies have been introduced in the PINN literature, such as using two neural networks, one for the PDE and the other to satisfy the essential boundary condition [3, 7, 61–63], introduction of a penalty parameter via an augmented variational formulation to weakly impose the essential boundary conditions [9], and Nitsche’s method to impose the essential boundary condition [64]. Some of these approaches mirror those previously pursued in meshfree and particle methods to satisfy essential boundary conditions [19, 20, 65]. In meshfree Galerkin methods, the choice of the space for Lagrange multipliers is delicate; penalty approach leads to a saddle-point problem and the inf-sup condition must be met; and though Nitsche’s method is variationally consistent, the stabilization parameter in it must be judiciously chosen. For low-dimensional problems over complex geometries, an accurate and robust meshfree approach remains elusive. Since from the universal approximation theorem [66, 67] we know that a neural network with one hidden layer can represent any L2 function to arbitrary accuracy, it stands to choose an ansatz that satisfies the boundary conditions a priori, so that the loss function is expressed solely in terms of the residual error at only the interior collocation points where the PDE is required to be satisfied. If the essential boundary conditions are exactly met, then this precludes “variational crimes” in a Ritz method [68]. Lastly, and most importantly, in deep collocation [6], multiple terms (inte-rior loss and boundary losses) that have to be individually minimized are incorporated within a single objective (loss) function. When this loss function is minimized, then the solution that is realized depends on the weight (equal weights is the unbiased choice) that is assigned to each objective function, which reflects the importance of each residual error contribution. Rohrhofer et al. [69] discuss network training in relation to the Pareto front that appears in multiobjec-tive constrained optimization. In the NVIDIA SimNet ™ toolkit [70], signed distance function weighting is used to 3dynamically assign the spatially varying weight functions for the PDE and boundary loss terms. Since these weights are problem dependent, they should not be fixed a priori, since then the magnitude of the training loss by itself is a misleading error measure. To establish the accuracy of the method, the error in u as well as in u′ must be assessed. We present numerical results in Sections 9–11 that support this thesis, and which points to the merits of the new approach. In this paper, we solve this problem of competing loss terms in PINN formulations by constructing a trial function for the neural network that a priori satisfies all boundary conditions in deep collocation, and meets kinematic admis-sibility when used in a deep Ritz method. This eliminates the boundary terms in the loss function. Our approach is based on constructing distance functions (exact or approximate) to the boundary of the domain, and it can treat essen-tial (Dirichlet) as well as mixed (Dirichlet and Robin) boundary conditions over complex domains. We use the exact distance function whenever it is available and applicable. However, in general, we construct approximate distance functions using two di ff erent techniques: the theory of R-functions [23, 30]) and the theory of mean value potential fields [34–36]. These methods provide approximate distance functions that possess the desirable property of being zero on the boundary of the domain with unit (inward) normal directional derivative. In addition, they are smooth in the interior of the domain, a property that the exact distance function does not always possess. Functions whose sign solely depend on the sign of its arguments encode Boolean logic, and are known as R-functions. R-functions provide an implicit function representation for line segments, curves, and solid regions, and are composed by Boolean operations (negation, conjunction, disjunction, equivalence). Mean value potential fields are specific forms of a sin-gular double-layer potential that yield Lp-distance fields [36]. For a domain in R2, this singular potential is defined as the integral of the reciprocal of the p-th power of the distance from its boundary. For a polygon with p = 1, closed-form expressions for the ADF are available [34], but for closed curves in R2, numerical integration is required to compute the ADF. Once the approximate distance functions are formed, methods to impose essential and Robin boundary conditions are available that rely on transfinite interpolation [23, 25]. R-functions with approximants such as B-splines [27, 31] and RBFs [71] have been used in a meshfree Galerkin method to solve boundary-value problems. The remainder of this paper is organized as follows. In Section 2, we discuss the properties of distance functions, and in Section 3, the essentials on R-functions and the construction of approximate distance functions are described. In particular, joining R-functions using R-equivalence composition is presented, which is used in this paper. The inverse of the normalizing weight function that appears in the expression for mean value coordinates (polygon) and transfinite mean value interpolant (closed curves), is an approximate distance field. These are particular instances ( p = 1) of 
+
+Lp-distance fields, and are discussed in Section 4. On using normalizing functions and solution structures in the R-function method [22, 23], we describe in Section 5 the use of ADFs to construct an ansatz in PINN that exactly satisfies boundary conditions for second- and fourth-order problems. The construction of the trial function in a deep neural network is presented in Section 6, along with a summary of the feedforward neural network and the backpropagation (computation of the gradient of the loss function and use of stochastic gradient descent) algorithm. The loss function for collocation and deep Ritz formulations are presented in Section 7. The numerical implementation is discussed in Section 8, where we also provide code snippets of some of the main functions. One- and two-dimensional numerical simulations are presented in Sections 9 and 10, where we apply this new approach to a broad suite of boundary-value problems (Poisson, harmonic coordinates, plate bending, Eikonal equation) on convex and nonconvex domains with a ffi ne and curved boundaries. In addition, the Poisson equation over the four-dimensional hypercube is solved in Section 11. These numerical results clearly demonstrate the benefits of exactly imposing the boundary conditions in PINN—it simplifies the training of the network and enhances the accuracy and robustness of the method. Finally, we conclude with Section 12, where we summarize the main developments in this paper and discuss some of the topics of future research. 
+
+2. Distance Functions and their Properties 
+
+The signed distance function is an implicit representation for curves and surfaces, and also provides fast evaluation of predicates for geometric objects. Let S ⊂ Rd denote a domain (open, bounded set) with boundary ∂S . The exact distance function d(x) gives the shortest distance between any point x ∈ Rd to ∂S . It is clear that d(x) is identically zero on ∂S . Computing the exact distance function requires solving the Eikonal equation (see Section 10.5), which is computationally expensive. Therefore, it is desirable to construct an approximate distance function or ADF (formally represented by φ(x)) that has a closed-form (non-iterative algorithm) expression. Furthermore, since the exact distance function may only be continuous and not continuously di ff erentiable, it may not be suitable for use in a trial function to solve PDEs. Since our objective is to use ADFs in a collocation or Ritz method to solve boundary-value problems, their di ff erential properties are important. If essential boundary conditions are imposed on the entire boundary ∂S ,4then the ADF must be zero on ∂S , positive in S , and its gradient must not vanish for any x ∈ ∂S . In addition, since the exact distance function has derivative discontinuities on the medial axis of the domain, smooth approximations of the distance function must be used within the trial function for a collocation method with PINN. For a second-order problem, a C0 distance function that has gradient discontinuities in the interior of the domain cannot be used in the collocation approach since the Laplacian of the distance function will be unbounded at a collocation point. These considerations are crucial when used to solve PDEs. For instance, positivity in S precludes the presence of singularities within the domain, which in general is di ffi cult to construct as noted in McFall [4]. For clarity of exposition in this paper, we use ν := ν(x) to denote the unit inward normal vector (appears in the theory related to R-functions) on the boundary ∂S , and n := n(x) as the unit outward normal vector (used when defining Neumann or Robin boundary condition) on ∂S . It is noted that n = −ν. If ∂S in R2 is composed of piecewise line segments and curves, then we use φi := φi(x) to denote the ADF to each curve or line segment. For a point x ∈ Rd
+
+on ∂S , it is essential that any approximation to the distance function satisfy φ = 0. Furthermore, to mimic the exact distance function, the normal derivative with respect to ν on the boundary should be unity, ∂d/∂ν = ∇d · ν = 1, and it is desirable that all higher order normal derivatives vanish. An m-th order approximate distance function requires that the second- to m-th order normal derivatives vanish on all regular points (unit normal is well-defined) on ∂S [23]: 
+
+∂φ ∂ν = 1, ∂kφ∂ν k = 0 (k = 2, 3, . . . , m), (1) and such a function is said to be normalized to the m-th order. For finite m, the normalized function matches the exact distance function only in the vicinity of the boundary; for points that are away from the boundary, it deviates from the exact distance. Apart from applications in solid modeling, mesh generation, real-time rendering and computer vision, where distance functions are used, normalized first-order distance functions are also a suitable choice for the initialization and assignment of the extension velocity at points away from the interface in the level set method [72]. As noted in Biswas and Shapiro [29], use of normalized distance functions mitigate the bulging phenomenon in the vicinity of where the segments or curves are joined [73], since undulations (presence of local extrema) are undesirable in the representation of the surface. 
+
+3. R-functions and Approximate Distance Functions 
+
+The theory of R-functions can be used to construct a composite approximate distance function, φ(x), to any arbi-trarily complex boundary ∂S , when approximate distance functions, φi(x), to the partitions of ∂S are known. Consider a real-valued function F(ω1, ω 2, . . . , ω q), where ωi(x) : Rd → R (i = 1, . . . , q) are also real-valued functions. If the sign of F(·) is solely determined by the signs of its arguments ωi(x), then F(·) is known as an R-function [26, 30]. R-functions were proposed by T. L. Rvachev in 1963 [26]. For example, F1(x, y) = 1 + x2 + y2 and F2(x, y) = xy 
+
+are R-functions in R2, whereas F3(x, y) = √x2 + y2 − 1 and F4(x, y) = sin xy are not. The important properties of R-functions are provided in Rvachev and Sheiko [23] and Shapiro [26]. On combining set-theoretic Boolean operations with such functions, the inverse problems of semi-analytic geometry (solid modeling) can be solved. Consider a continuous function ωi : Rd → R. Let Ω ⊂ Rd be an open, bounded domain, ¯ Ω = Ω ∪ ∂Ω be the closure of Ω, and define Ωc to be the complement of Ω (Ω ∪ Ωc = Rd). If ωi is strictly positive in Ω, identically equal to zero on ∂Ω, and strictly negative in Ωc\∂Ω, then it is evident that F(ωi) = ωi is an R-function. Over the region ¯ Ω, we associate ωi with the Boolean 1 (logical true) and over the region Ωc we associate it with the Boolean 0 (logical false). Note that ωi = 0 (0 is assumed to be signed) is included in both sets so that it can be assigned to either the set of negative real values or the set of positive real values [30]. Hence, similar to Boolean functions, ωi is closed under composition. Furthermore, just as Boolean functions are written using the symbols ¬ ∨ , and ∧, which correspond to complement, union and intersection in set theory, every R-function can be written as the composition of the corresponding elementary R-functions: R-negation ( −ω), R-disjunction ( ω1 ∨ ω2), and R-conjunction ( ω1 ∧ ω2). On defining R-functions for regions in Rd, a solid can then be composed using the set-theoretic operations of ¬, ∨,and ∧. For the universal set U = R2, Venn diagrams for some of the operations in set theory are shown in Fig. 1, and the corresponding operations using R-functions are indicated. The simplest examples of R-functions are the R-disjunction (union) and the R-conjunction (intersection) functions. These are 
+
+ω1 ∨ ω2 = ω1 + ω2
+
+2 +
+
+√(ω1 − ω2)2
+
+2 = max( ω1, ω 2), ω1 ∧ ω2 = ω1 + ω2
+
+2 −
+
+√(ω1 − ω2)2
+
+2 = min( ω1, ω 2).
+
+5A B(a) A B (b) A B (c) A B (d)                                     
+
+> Figure 1: Venn diagram for union, intersection, complement and equivalence in R2.ωAand ωBare R-functions that are positive in ΩAand ΩB
+> (open sets), respectively. (a) A∪B≡ωA∨ωB; (b) A∩B≡ωA∧ωB; (c) ¯ A≡ − ωA; and (d) ( A∩B)∪( ¯ A∩¯B)≡ωA∼ωB. Examples of ∨and ∧
+> operations using R-functions are given in (2) and (3).
+
+The generalization of the above R-functions is [30]: 
+
+Rα(ω1, ω 2) : = 11 + α
+
+(
+
+ω1 + ω2 ±
+
+√
+
+ω21 + ω22 − 2αω 1ω2
+
+)
+
+, (2) with ( +) and ( −) signs defining R-disjunction and R-conjunction, respectively. If ω1 and ω2 denote the sides of a triangle, then the triangle inequality is expressed in (2) with −1 < α < 1 being the cosine of the angle between the two sides. For α = 1, the max and min R-functions are recovered. If ω1 and ω2 are positive, then so are ω1 ∨ ω2 and 
+
+ω1 ∧ ω2. The R-functions defined in (2) are not analytic at points where ω1 = ω2 = 0. Smoothness can be obtained by defining the function ( α = 0 is selected) [30] 
+
+Rs(ω1, ω 2) : =
+
+[
+
+ω1 + ω2 ±
+
+√
+
+ω21 + ω22
+
+] ( ω21 + ω22
+
+) s 
+
+> 2
+
+, (3) which renders these functions to be Cs-continuous at all points other than where ω1 = ω2 = 0. 
+
+3.1. Normalized functions for line segments and curves 
+
+Shapiro and Tsukanov [74] describe the representation of line segments and curves using R-functions and discuss their di ff erential properties. Let us consider one line segment that joins x1 := (x1, y1) and x2 := (x2, y2). The center of this segment is denoted by xc := (x1 + x2)/2, and the length of the segment is: L = || x2 − x1|| . Now, we define [25] 
+
+f := f (x) = (x − x1)( y2 − y1) − (y − y1)( x2 − x1)
+
+L , (4) which is the signed distance function from point x to the line that passes through x1 and x2.Since the representation of the segment can be viewed as the intersection of an infinite line with a disk of radius 
+
+L/2, we consider the following trimming function that is normalized to first order [25]: 
+
+t := t(x) = 1
+
+L
+
+[( L
+
+2
+
+)2
+
+− || x − xc|| 2
+
+]
+
+, (5) where t ≥ 0 defines a disk with center at xc. Now, with f (x) and t(x) on-hand, we define a normalized function (up to first order) φ(x) that is C2 at all points away from the line segment [25, 29]: 
+
+φ := φ(x) =
+
+√
+
+f 2 +
+
+( ϕ − t
+
+2
+
+)2
+
+, ϕ =
+
+√
+
+t2 + f 4, (6) which is an approximation of the distance function to the segment with end points x1 and x2. The function φ in (6) is a modification of the form ϕ = |t| [74], which has a derivative discontinuity at t = 0. Figure 2 provides a graphical illustration of f , t and φ for a line segment. For a quarter-circular arc, the functions 
+
+f , t and φ are shown in Fig. 3. In Fig. 4, the approximate distance function (normalized to order 1) to a circle and an ellipse are presented. The ADF to a circle of radius R and center located at xc := (xc, yc) is given by 
+
+φ(x) = R2 − (x − xc) · (x − xc)2R , (7) 6-1 -0.5 0 0.5 1-0.5 00.5         
+
+> -0.5 00.5 -1 -0.5 00.5 1-0.5 00.5
+> -1 -0.5 0-1 -0.5 00.5 1-0.5 00.5
+> 00.5 1
+
+Figure 2: Construction of the approximate distance function to a line segment. The leftmost plot depicts the signed distance function (4) to a line in 
+
+R2; the middle plot shows the trimming function (5); and the rightmost plot displays the approximate distance function (6) to a line segment. 0.5 1 1.5 2-0.5 00.5 1      
+
+> -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 00.2 0.5 11.5 2-0.5 00.5 1
+> -1 -0.5 00.5 10.5 11.5 2-0.5 00.5 1
+> 0.2 0.4 0.6 0.8 11.2 1.4 1.6 1.8
+
+Figure 3: Construction of the approximate distance function to a quarter-circular arc. The leftmost plot depicts f , the equation of the circular arc (normalized to first order); the trimming function is shown in the middle plot; and the rightmost plot displays the approximate distance function given by (6). 
+
+where φ(x) is a smooth (bivariate polynomial of degree 2) function. For an elliptical disk whose closure (interior and boundary) is given by the R-function ω(x) ≥ 0, we construct an ADF that is normalized to order 1 using [23] 
+
+φ(x) = ω(x)
+
+√ω2(x) + ||∇ ω(x)|| 2 . (8) -1 -0.5 0 0.5 1-1 -0.5 00.5 1
+
+00.2 0.4 0.6 0.8 
+
+> (a)
+
+-1 -0.5 0 0.5 1-1 -0.5 00.5 1
+
+00.2 0.4 0.6 0.8 (b) 
+
+Figure 4: Approximate distance function (normalized to order 1) to a (a) circle and an (b) ellipse. 
+
+In general for curves that are given in parametric form, such as B´ ezier and non-uniform rational B-spline (NURBS) curves, constructing ADFs require implicitization of the curve. These extensions are discussed and presented in Upreti et al. [75]. For some of the considerations and challenges in the representation (implicit and parametric) of curves in enriched computational methods, see Chin and Sukumar [76]. 73.2. R-equivalence operation 
+
+Given the normalized distance functions φ1 and φ2 for two curves c1 and c2, a distance field φ(φ1, φ 2) for the union 
+
+c1 ∪ c2 must be zero when either φ1 = 0 or φ2 = 0 and positive otherwise. When c1 and c2 are line segments, the naive formula φ(φ1, φ 2) = φ1φ2 is no longer normalized at the regular points of the segments. An R-equivalence solution that preserves normalization up to order m of the distance function at all regular points (nonvertices for polygonal curves) is given by [29]: 
+
+φ(φ1, φ 2) : = φ1 ∼ φ2 = φ1φ2
+
+> m
+
+√φm 
+
+> 1
+
++ φm
+
+> 2
+
+= 1
+
+> m
+
+√ 1
+
+> φm
+> 1
+
++ 1
+
+> φm
+> 2
+
+, (9) where lim m→∞ φ(φ1, φ 2) = min( φ1, φ 2). When ∂S (closed curve) is composed of n pieces, then a φ that is normalized up to order m is given by (see the proof in [33]): 
+
+φ(φ1, . . . , φ n) : = φ1 ∼ φ2 ∼ · · · ∼ φn = 1
+
+> m
+
+√ 1
+
+> φm
+> 1
+
++ 1
+
+> φm
+> 2
+
++ . . . + 1
+
+> φmn
+
+. (10) The φ that is formed in (10) can be viewed as the reciprocal of the Lm-norm of inverse distance measures, which bears similarity to Lm-distance fields [36]. An alternative joining procedure is to consider the R-conjunction given by 
+
+φs(φ1, φ 2) : = φ1 ∧ φ2 = φ1 + φ2 − s
+
+√
+
+φs 
+
+> 1
+
++ φs
+
+> 2
+
+, (11) which is a function that is normalized to the ( s − 1)-th order [29]. However, the joining operation is not associative, which makes this choice less desirable. The R-equivalence joining relation in (10) is associative. The approximate distance function to two line segments is shown in Fig. 5, where the R-conjunction composition in (11) with s = 2, 3 and the R-equivalence relation in (9) with m = 1, 2 are compared. The bulging phenomenon [73] is noticeable in the vicinity of the joining point. In Fig. 6, we present the approximate distance function to a curved triangular region using R-equivalence for di ff erent orders of the normalizing parameter m. As m increases, the ADF approaches the exact distance, which is observed when inspecting the contours in the interior of the curved triangle. The ADFs for a triangle, square, hexagon, and an L-shaped polygon are shown in Fig. 7. The ADF for a square in Fig. 7b bears similarities to a superellipse, |x|m + |y|m = 1, which has rounded corners as m increases. As the last example, we present the ADF for a complex polygonal domain. We consider the polygonalized map of Bhutan, 1
+
+which has 291 boundary vertices. The contour plot for φ(x) is depicted in Fig. 8, and we observe that the contours are smooth in the interior and well-separated ( φ is monotonic in Ω). Finally, we mention that a modified form of the R-equivalence relation (10) is also discussed in Biswas and Shapiro [29]. This modified form is constructed with an eye on better capturing the first-order normalization condition in the sector region between two line segments, where the normal is undefined and the closest point to either segment is the common vertex. With increase in m, the R-equivalence joining operation provides a better approximation to the exact distance away from the segments and also improved normalization properties in the vicinity of the segments; however, this comes at the expense of the function being higher order and hence its Laplacian will have greater undulations. Use of m = 2 has been adopted in prior computational studies [29, 33, 75], but herein, we adopt m = 1 in most of the numerical simulations that are presented in Sections 9–11. 
+
+4. Generalized Mean Value Potentials and Lp-Distance Fields 
+
+In addition to the theory of R-functions, another approach to construct approximate distance fields is via the theory of mean value potential fields. This has connections to generalized barycentric coordinates and in particular to mean value interpolation over polygons and curved domains [77]. Generalized barycentric coordinates [78–80] are an extension of barycentric coordinates over simplices to polygons and polyhedra. These coordinates (shape functions) have linear precision and are nonnegative over convex polygons. Transfinite barycentric interpolation over domains bounded by curves is the continuous counterpart of generalized barycentric coordinates over polygons [77]. Given a function u : R2 → R that assumes the function g(x) on the boundary (curved) of a domain, a transfinite interpolant provides an approximation of u(x) that matches g(x) over the curved boundary of the domain. For a domain that    
+
+> 1Vectorized eps image obtained from https://freevectormaps.com
+
+8-1 -0.5 0 0.5 1-1 -0.5 00.5 1    
+
+> 0.2 0.4 0.6 0.8 11.2 1.4 -1 -0.5 00.5 1-1 -0.5 00.5 1
+> 0.2 0.4 0.6 0.8 11.2 1.4 1.6 1.8
+
+(a) -1 -0.5 0 0.5 1-1 -0.5 00.5 1     
+
+> 0.2 0.4 0.6 0.8 11.2 -1 -0.5 00.5 1-1 -0.5 00.5 1
+> 0.2 0.4 0.6 0.8 11.2 1.4 1.6
+
+(b) 
+
+Figure 5: Approximation of the distance function to two line segments. (a) R-conjunction composition with s = 2, 3 in (11), and (b) R-equivalence composition in (9) for the normalization parameter m = 1, 2. The ADFs are normalized to order s − 1 and m, respectively. 0.5 1 1.5 2-0.5 00.5 1
+
+> 0.2 0.4 0.6 0.8 11.2 1.4
+
+(a) m = 20.5 1 1.5 2-0.5 00.5 1 
+
+> 0.2 0.4 0.6 0.8 11.2 1.4 1.6
+
+(b) m = 30.5 1 1.5 2-0.5 00.5 1 
+
+> 0.2 0.4 0.6 0.8 11.2 1.4 1.6 1.8
+
+(c) m = 60.5 1 1.5 2-0.5 00.5 1 
+
+> 0.5 11.5 2
+
+(d) m = 10 
+
+Figure 6: Plots of the approximate distance function to a curved triangle using R-equivalence for di ff erent choices of the normalizing parameter (m = 2, 3, 6, 10). -1 -0.5 0 0.5 1-1 -0.5 00.5 1
+
+> 0.1 0.2 0.3 0.4 0.5 0.6 0.7
+
+(a) -1 -0.5 0 0.5 1-1 -0.5 00.5 1 
+
+> 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
+
+(b) -1 -0.5 0 0.5 1-1 -0.5 00.5 1 
+
+> 0.1 0.2 0.3 0.4 0.5 0.6
+
+(c) -1 -0.5 0 0.5 1-1 -0.5 00.5 1 
+
+> 00.1 0.2 0.3 0.4 0.5
+
+(d) 
+
+Figure 7: Plots of the approximate distance function using R-equivalence ( m = 1) for polygons. (a) triangle, (b) square, (c) regular hexagon, and (d) L-shaped (nonconvex) polygon. 100 200 300 400 500 600 100 200 300 400 
+
+20 40 60 80 100 
+
+Figure 8: Plot of the approximate distance function using R-equivalence ( m = 1) for polygonalized map of Bhutan. The polygon has 291 vertices. 
+
+9is bounded by a ffi ne or curved boundaries, the reciprocal of the mean value normalization function is a smoothed approximation to the exact distance function [35, 81, 82], and is a specific instance ( p = 1) of the reciprocal of a singular double-layer Lp-potential field [36]. We refer to the method that generates these smoothed distance functions by the acronym MVP, since they stem from (generalized) mean value potential fields [77]. The construction of φ(x)over polygons and curved domains is presented in Sections 4.1 and 4.2, respectively. 
+
+4.1. Approximate distance fields on arbitrary planar polygons 
+
+A popular generalized barycentric coordinate is Floater’s mean value coordinates [34], which is derived using the circumferential mean value theorem for harmonic functions. This conception stemmed from the objective to approximate a harmonic map by a convex combination map (positive weights) over a triangulation, so that injectivity is preserved. Mean value coordinates have many remarkable properties: for instance, they are valid on arbitrary planar polygons, including nested polygons; C∞ smooth in Ω with derivative discontinuities only at the vertices of the polygon; they reduce to piecewise a ffi ne functions on the edges of a polygon; are nonnegative in the kernel of the polygon; reciprocal of the normalizing weight function is a smoothed ADF; and they also have a smooth extension outside the polygon [81]. Consider the nonconvex polygon ( n-gon) shown in Fig. 9, whose n vertices are defined in counterclockwise orien-tation. The coordinates of the vertices are {xi}ni=1, and x is an arbitrary point in the interior of the polygon. The mean value coordinates, {ϕi(x)}ni=1, are defined as [34]: 
+
+ϕi(x) = wi(x)
+
+W(x) , wi(x) : = tan ( αi−1/2) + tan ( αi/2) 
+
+|| xi − x|| , W(x) =
+
+> n
+
+∑
+
+> j=1
+
+w j(x), (12) where the angles αi−1 and αi are shown in Fig. 9. Let ri := xi − x with ri = || xi − x|| represent the Euclidean distance between x and xi. On noting the half-angle formula for tan( ·), we can define 
+
+ti := tan 
+
+( αi
+
+2
+
+)
+
+= sin αi
+
+1 + cos αi
+
+= riri+1 sin αi
+
+riri+1 + ri · ri+1
+
+= det ( ri, ri+1)
+
+riri+1 + ri · ri+1
+
+, W(x) =
+
+> n
+
+∑
+
+> i=1
+
+( 1
+
+ri
+
++ 1
+
+ri+1
+
+)
+
+ti (rn+1 := r1), (13) which is now valid for all points x that are in the interior of a convex or nonconvex polygon. The denominator vanishes when αi = π, i.e., when x lies on the boundary of the polygon, but there ϕi(x) are known. The singularity of the weight function on the boundary is a property shared by nonnegative generalized barycentric coordinates. i
+
+# rri−1 
+
+i
+
+# αα
+
+ii−1 r
+
+i+1 
+
+# xxi−1 
+
+# xxi+1 
+
+> (a)
+
+# θi − θ y( ) 
+
+# xrx
+
+i+1 ii
+
+# x rθ
+
+# ri+1 
+
+# θα 
+
+> (b)
+> Figure 9: Notation used in the definition of (a) mean value coordinates [34] and (b) generalized mean value potentials [77]. In (b), the parameters that are used to form W(x) in (13) are shown.
+
+For a polygon with one interior (nested) m-gon, the vertices of the inner polygon are defined in clockwise orien-tation [81]. The contributions of {wi(x)}n+mi=1 are used to form W(x) in (12). Hormann and Floater [81] showed that 10 φ(x) = 1/W(x) is an ADF to the boundary of the polygon, where its normal derivative is 1 /2. On taking 
+
+φ(x) = 2
+
+W(x) , (14) where the scaling factor is the volume of the unit sphere in Rd−1 [82] (equal to 2 when d = 2), the normal derivative becomes ∂φ/∂ν = 1 on ∂Ω. As we discuss in Section 4.2, the W(x) that appears in (13) and (14) is a particular instance (p = 1) of the mean value potential field Wp(x). We now have a smooth ADF for a polygon that is normalized to order 1. In Fig. 10, the surface and contour plots of φ(x) are shown for square and L-shaped domains, as well as nested squares and octagons. The ADF using R-equivalence for the polygonalized map of Bhutan is presented in Fig. 8. In Fig. 11, we show the approximate distance field (surface and contour plots) for the same polygonalized map. 00.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 
+
+(a) 00.02 0.04 0.06 0.08 0.1 0.12 (b) 00.02 0.04 0.06 0.08 0.1 0.12 0.14 
+
+(c) (d) 
+
+Figure 10: Approximate distance fields for polygonal domains. (a) square, (b) L-shaped, (c) nested squares, and (d) nested octagons. 
+
+(a) (b) 
+
+Figure 11: Approximate distance field for the polygonalized map of Bhutan. Surface plot is shown in (a) and the contour plot in (b). The polygon has 291 vertices. 
+
+11 4.2. Approximate distance fields over curved domains 
+
+Consider an open, bounded nonconvex domain Ω with boundary Γ = ∂Ω. Given a function g : Γ → R that is prescribed on a curved boundary, the transfinite mean value interpolant u : Ω → R is defined as [35, 83] 
+
+u(x) =
+
+∫ 
+
+> Sv
+
+g(y(x, v))K(x, y) dS v
+
+W(x) , W(x) =
+
+∫ 
+
+> Sv
+
+K(x, v) dS v, K(x, v) = 1
+
+‖v − x‖ , (15) where x ∈ Ω\S v and v ∈ S v, S v is the unit circle that is centered at x, the ray from x that passes through v intersects the boundary Γ at y, and K(x, v) is a singular kernel function [35, 83]. Similar to the behavior of the inverse weight in mean value coordinates on polygons, the function φ(x) = 1/W(x)behaves like an approximate distance function to the boundary and its normal derivative on the boundary Γ is 1 /2 [35]. Belyaev et al. [36] introduced Lp-distance fields ( p ≥ 1), which approximates the exact distance function. These distance fields stem from a particular form of a singular double-layer potential, and hence the reference to them as generalized mean value potential fields. Consider the nonconvex domain shown in Fig. 12. The parametrization of the curved boundary Γ : [0 , 1] → R is c(t), and its tangent is c′(t). For a nonconvex domain, a ray from x intersects the boundary at multiple points c(ti) : = yi(x, v). On projecting the boundary curve onto the unit circle, an expression for 
+
+φ(x) that is valid for convex as well as nonconvex domains is obtained in terms of the curve parameter t ∈ [0 , 1] [83]: 
+
+φ(x) =
+
+( 1
+
+Wp(x)
+
+)1/p
+
+, Wp(x) =
+
+∫ 10
+
+(c(t) − x) · c′⊥ (t)
+
+‖c(t) − x‖2+p dt , (16) where c′⊥ (t) : = rot (c′(t)) is obtained by rotating c′(t) through 90 ◦ in the clockwise direction. For x ∈ Ω, the integral in (16) is numerically integrated; if x ∈ ∂Ω (integral is singular), we set φ(x) = 0. In (16), Wp(x) is the generalized mean value potential field, which is used to form the approximate distance function φ(x). Equation (16) is also applicable for polygons: on choosing p = 1 in (16), we recover the W(x) that appears in (13). The approximate distance fields ( p = 1) for an elliptical disk, annulus, hypocycloid, and a propeller-shaped domain are shown in Fig. 13. The distance function is smooth in the interior of the domain, φ ∈ C∞(Ω), and it is Ck on the boundary for a Ck curve (derivative discontinuities occur at the vertices for a polygonal curve). Over curved two-dimensional domains, Dyken and Floater [35] assessed the approximation properties of the transfinite mean value interpolant as well as its use to solve the Poisson equation with web-splines [32], and Chin and Sukumar [84] have used it in verification tests of a cubature rule for numerical integration over curved regions. 
+
+# •
+
+# x
+
+# •v•y1(x, v)
+
+# •y2(x, v)
+
+# •
+
+# y3(x, v)
+
+# c(t) 
+
+> Figure 12: Nonconvex domain bounded by the curve c(t). The variables that appear in (15) and (16) are shown.
+
+5. Imposing Boundary Conditions in Deep Neural Networks 
+
+Let ˜ ubc nn (x; θ) denote the PINN trial function. We present the construction of ˜ ubc nn (x; θ) so that it exactly satisfies all essential and Robin (natural boundary condition is a particular case) boundary conditions. The trial function includes 12 00.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 (a) 00.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 (b) 00.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 
+
+(c) 00.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 (d) 
+
+Figure 13: Approximate distance fields on curved domains using φ(x) = 2/W1(x), with W1(x) given in (16). Surface and contour plots are shown for an (a) elliptical disk, (b) annulus, (c) hypocycloid, and (d) propeller [76]. 
+
+the contribution from the neural network approximation, ˜ uR
+
+nn (x; θ), where θ contains the unknown parameters of the network. We defer the presentation of ˜ uR
+
+nn (x; θ) to Section 6. 
+
+5.1. Normalizing functions and solution structures 
+
+Let Ω ⊂ R2 be an open, bounded domain with boundary ∂Ω. Consider a smooth function u(x) : R2 → R and let φ(x) : R2 → R be an approximate distance function to ∂Ω that is normalized to the m-th order (see Section 2). Rvachev defined the normalizers to u of the m-th order with respect to φ(x) via the transformation [23, 25] 
+
+u∗(x) = u(x − φ∇φ), (17) which leads to u∗(x) = u(x) on ∂Ω and ∂ku∗/∂ν k = 0 ( k = 1, 2, . . . , m) on ∂Ω . Since we are treating Dirichlet and Robin boundary conditions in this paper, we proceed to show that (17) is normalized up to order 1. If u(x) is specified on ∂Ω, then since φ = 0 on ∂Ω it implies that u∗(x) = u(x) on ∂Ω, which establishes zeroth-order normalization. The proof for first-order normalization follows. 
+
+Proof . Let t := x − φ∇φ. Then, by the chain rule we have 
+
+∇u∗ = ∇t u∗ · ∇ ⊗ t = ∇t u∗ · [I − φ∇ ⊗ ∇ φ − ∇ φ ⊗ ∇ φ],
+
+where ⊗ is the dyadic (tensor) product and ∇t (·) is the gradient with respect to t. On ∂Ω, φ = 0, t = x and ∇φ = ν,since φ is normalized to the first order. Hence, we can write 
+
+∂u∗
+
+∂ν 
+
+∣∣∣∣∣∂Ω
+
+= [∇u∗ · ν]∂Ω = [∇u∗]∂Ω ·
+
+([I − φ∇ ⊗ ν − ν ⊗ ν] · ν
+
+)
+
+∂Ω
+
+= [∇u∗]∂Ω · (ν − ν) = 0,
+
+which is the desired result. 
+
+Note that this result also holds if we consider the unit outward normal vector, n = −ν. By extension, if φ is normalized to the m-th order, one can establish that u∗ in (17) is normalized to the m-th order, i.e., 
+
+u∗(x) = u(x) on ∂Ω, ∂ku∗
+
+∂ν k = 0 on ∂Ω (k = 1, 2, . . . , m). (18) Let u and its higher order normal derivatives along ν be prescribed on ∂Ω, i.e., 
+
+u(x) = u0(x) on ∂Ω, ∂ku
+
+∂ν k = uk(x) on ∂Ω (k = 1, 2, . . . , m). (19) 13 Then, one can represent u in the vicinity of ∂Ω and in the direction (inward) normal to the boundary using a polynomial Taylor series expansion of u in terms of φ. Rvachev et al. [23, 24] referred to this as a generalized Taylor series expansion, which takes the form: 
+
+u(x) = u∗
+
+> 0
+
+(x) +
+
+> m
+
+∑
+
+> k=1
+
+u∗
+
+> k
+
+(x)
+
+k! φk(x) + φm+1(x) Ψ(x), (20) where u∗
+
+> k
+
+(x) = uk(x − φ∇φ) and Ψ(x) is an unknown function (approximation) in the remainder term. This equation resembles univariate Taylor series expansion about x = 0, where instead of evaluating derivatives that are scalar constants, u∗
+
+> k
+
+(x) in (20) are scalar-valued functions. Equation (20) is the general form of the solution structure for u
+
+that Rvachev introduced [23, 24]. If a function u ∈ Ck(Ω), and its first m derivatives vanish on ∂Ω, then the solution structure φm+1Ψ is su ffi ciently complete in the Hilbert space Hk(Ω) to approximate u and all its derivatives up to order 
+
+k [24]. On using (20), the boundary conditions on ∂Ω that are given in (19) are exactly met. We now present the solution structure for three distinct sets of boundary conditions (Dirichlet, Neumann and Robin) that are imposed on 
+
+∂Ω.
+
+5.1.1. Solution structure for Dirichlet boundary condition 
+
+If u = g is prescribed on ∂Ω, then using u∗
+
+> 0
+
+(x) = g(x − φ∇φ) and m = 0 in (20), we can write: 
+
+u(x) = g(x − φ∇φ) + φ(x)Ψ(x) = g(x) − φ(x) ∂g(x)
+
+∂ν 
+
+∣∣∣∣∣∂Ω
+
++ φ(x)Ψ(x) = g(x) + φ(x) ∂g(x)
+
+∂n
+
+∣∣∣∣∣∂Ω
+
++ φ(x)Ψ(x)
+
+= g(x) + φ(x)Ψ(x),
+
+(21) since n = −ν on ∂Ω, and a first order linearization of g(·) is used. Therefore, the solution structure is: 
+
+u = g + φ˜u, (22) where ˜ u is any suitable numerical approximation. So any trial function of the form given in (22) will exactly satisfy the essential boundary condition u = g on ∂Ω. In this paper, we use deep neural networks to construct ˜ u. For the homogeneous Dirichlet problem ( g = 0) using Kantorovich’s method, Babuˇ ska et al. [19] provide an a priori error estimate in the H1 norm. 
+
+5.1.2. Solution structure for Neumann boundary condition 
+
+Since Neumann and higher order boundary conditions for PDEs are imposed in the direction of n, the unit outward normal vector to ∂Ω, we consider solution structures that are defined with respect to n. This is a departure from the literature on the R-function method to solve PDEs, but is aligned with how boundary-value problems are posed. For first order derivatives, linearizing u∗(x) : = u(x − φ∇φ) in the neighborhood of the boundary ( φ = 0) amounts to subtracting the variation in the normal direction ν, which leads to [23] 
+
+u∗(x) = u(x) − [φ(x)∇u(x) · ν]∂Ω + φ2(x)Ψ(x) = u(x) − [φ(x)∇φ(x) · ∇ u(x)]∂Ω + φ2(x)Ψ(x)
+
+= (1 + φDφ 
+
+> 1
+
+)( u) + φ2Ψ, Dφ 
+
+> 1
+
+(·) : = [−∇ φ · ∇ (·)]∂Ω = ∂(·)
+
+∂n
+
+∣∣∣∣∣∂Ω
+
+, (23) where ν = −n = ∇φ on ∂Ω and Dφ 
+
+> 1
+
+(·) is a di ff erential operator that acts in the outward normal direction to the boundary. If ∂u/∂ n = h is prescribed on ∂Ω, then using (20) and (23), we can write 
+
+u(x) = u0
+
+(x − φ(x)∇φ(x)) + φ(x)u1
+
+(x − φ(x)∇φ(x)) + φ2(x)Ψ1(x)
+
+= [1 + φDφ 
+
+> 1
+
+]( u0) + φu1 + φ2Ψ(x) = [1 + φDφ 
+
+> 1
+
+]( u0) − φh + φ2Ψ,
+
+since ∂u/∂ν = u1 = −h on ∂Ω, and therefore the most general solution structure can be written as: 
+
+u = [1 + φDφ 
+
+> 1
+
+](˜ u1) − φh + φ2 ˜u2, (24) where ˜ u1 and ˜ u2 are arbitrary approximation functions. 14 5.1.3. Solution structure for Robin boundary condition 
+
+If the Robin boundary condition, ∂u/∂ n + cu = h, is prescribed on ∂Ω (c := c(x), h := h(x)), then following similar steps to that taken to obtain (22) and (24), we can write the most general solution structure for this boundary condition as [23]: 
+
+u = [1 + φ(c + Dφ
+
+> 1
+
+)] (˜ u1) − φh + φ2 ˜u2. (25) It is readily verified that the ansatz in (25) satisfies the boundary condition ∂u/∂ n + cu = h on ∂Ω.
+
+5.2. Imposing inhomogeneous essential boundary conditions 
+
+Let us extend our analysis to the case when di ff erent inhomogeneous essential boundary conditions are imposed on distinct subsets of ∂Ω. Now, let the boundary ∂Ω := Γ = ∪Ni=1Γi. The inhomogeneous essential boundary condition 
+
+u = gi is imposed on Γi (i = 1, 2, . . . , M), and on Γi (i = M + 1, M + 2, . . . , N) we assume natural boundary conditions are imposed through the potential energy functional in the variational principle. Let φi be the ADF that is associated with Γi (i = 1, 2, . . . , M), and let φ be the approximate distance field that is composed either via R-equivalence using 
+
+φ1 ∼ φ2 . . . φ M in (10) or using the mean value potential field, W(x), given in (14) and (16). Transfinite interpolation is the generalization of scattered data interpolation to interpolation of functions over curves and surfaces. On using the singular inverse-distance based Shepard weight function [85], we can write the transfinite interpolant as [23, 25] 
+
+g(x) =
+
+> M
+
+∑
+
+> i=1
+
+wi(x) gi(x), wi(x) = φ−μi
+
+> i
+
+∑Mj=1 φ−μi
+
+> j
+
+=
+
+∏Mj=1; j,i φμ j
+
+> j
+
+∑Mk=1
+
+∏Mj=1; j,k φμ j
+
+> j
+
+, (26) where the weights wi form a partition-of-unity, and (26) interpolates gi on the set Γi. In (26), μi ≥ 1 is a constant that controls the nature of interpolation that accrues. For μi = 1, the function gi is interpolated on Γi, whereas if μi = 2, both gi and ∂gi/∂ n are interpolated on Γi. Now, on using the solution structure for the Dirichlet problem given in (22) and following the work of Rvachev et al. [23, 24], we can write the trial function in the deep Ritz method as: ˜ubc nn (x; θ) = g(x) + φ(x) ˜ uR
+
+> nn
+
+(x; θ), (27) where φ can also be replaced by the product ∏Mi=1 φi. Since g(x) = gi(x) on Γi and φ(x) vanishes on ∪Mi=1Γi, kinematic admissibility of (27) is verified. 
+
+5.3. Imposing inhomogeneous essential and Robin boundary conditions 
+
+Let us consider the case that the boundary ∂Ω := Γ = Γ 1 ∪ Γ2 with Γ1 ∩ Γ2 = ∅. The boundary conditions on Γ1
+
+and Γ2 are: 
+
+u = g on Γ1, ∂u
+
+∂n + cu = h on Γ2. (28) Let φ1(x) and φ2(x) be the approximate distance functions to the boundaries Γ1 and Γ2, respectively. We use the R-equivalence ( m = 1) relation in (9) to form 
+
+φ(x) = φ1(x)φ2(x)
+
+φ1(x) + φ2(x) , (29) which is the ADF to the boundary Γ. For the case when the boundary is partitioned into two disjoint sets, we consider two approaches to form a trial function. The first approach uses superposition of two solution structures and has a simple form. The second method, which is based on transfinite interpolation, is applicable in general when there are multiple boundaries on which essential and Robin boundary conditions are imposed. In this paper, we apply the first approach to solve a one-dimensional problem with mixed (Dirichlet and Neumann) boundary conditions, and adopt the second approach to solve a problem with Dirichlet and Robin boundary conditions in two dimensions. 
+
+Approach I : We form solutions structures u1 and u2 such that 
+
+u1 = g on Γ1, ∂u1
+
+∂n + u1 = 0 on Γ2, u2 = 0 on Γ1, ∂u2
+
+∂n + u2 = h on Γ2, (30) and therefore the desired trial function is: u(x) = u1(x) + u2(x). Now, on using (22), we know that the function g + φ1 ˜u1
+
+satisfies the essential boundary condition on Γ1 but does not meet the Robin boundary condition on Γ2. To satisfy the Robin boundary condition on Γ2, we normalize u1 + φ˜u1 to the first order using h = 0 in (25) to obtain 
+
+u1 = [1 + φ(c + Dφ2
+
+> 1
+
+)] (g + φ1 ˜u1) = [1 + φDφ2
+
+> 1
+
+)] (g + φ1 ˜u1) + cφg + cφ1φ˜u1, (31a) 15 where 
+
+Dφ2 
+
+> 1
+
+(·) = [−∇ φ2 · ∇ (·)]Γ2 = ∂(·)
+
+∂n
+
+∣∣∣∣∣Γ2
+
+(31b) is the di ff erential operator that acts in the outward normal direction on the boundary Γ2. Similarly, using (25), the minimal structure for u2 is: 
+
+u2 = φ(φ2 ˜u2 − h). (32) Since φ = 0 on Γ and φ2 = 0 on Γ2, the conditions on u2 in (30) are satisfied. On choosing ˜ u1 = ˜u2 := ˜uR
+
+> nn
+
+(x; θ) in (31) and (32) and adding them up, the ansatz ˜ ubc nn (x; θ) is: ˜ubc nn (x; θ) = φ1(x) ˜ uR
+
+> nn
+
+(x; θ)
+
++ φ(x)
+
+[{ 
+
+φ2(x) + c(x)φ1(x)
+
+}
+
+˜uR
+
+> nn
+
+(x; θ) + Dφ2
+
+> 1
+
+(
+
+φ1(x) ˜ uR
+
+> nn
+
+(x; θ)
+
+)
+
++ Dφ2
+
+> 1
+
+(g(x)) + c(x)g(x) − h(x)
+
+]
+
++ g(x), (33) where φ(x) is given in (29). For mixed (Dirichlet and Robin) boundary conditions, the form of the trial function in (33) appears in Rvachev and Sheiko [23]. 
+
+Approach II : On using (22) and (25), we select the boundary functions u1 and u2 on Γ1 and Γ2, respectively, as: 
+
+u1 = g, (34a) 
+
+u2 = [1 + φ2
+
+(c + Dφ2
+
+> 1
+
+)] (˜ u) − φ2h, (34b) with φ1φ22 ˜u as the composite remainder term. From (26), we recall the transfinite interpolant, where we now choose 
+
+μ1 = 1 and μ2 = 2. On carrying out a few algebraic simplifications and using (34) with ˜ u := ˜uR
+
+> nn
+
+(x; θ), we can write the trial function ˜ ubc nn (x; θ) that exactly imposes the mixed boundary conditions as: ˜ubc nn (x; θ) = w1(x)u2(x; θ) + w2(x)u1(x) + φ1(x)φ22(x)˜ uR
+
+> nn
+
+(x; θ), (35a) where 
+
+w1(x) = φ1(x)
+
+φ1(x) + φ22(x) , w2(x) = φ22(x)
+
+φ1(x) + φ22(x) , (35b) and 
+
+u1(x) = g(x), u2(x; θ) = [1 + φ2(x)(c(x) + Dφ2
+
+> 1
+
+)] ( ˜uR
+
+> nn
+
+(x; θ)) − φ2(x)h(x). (35c) 
+
+6. Approximation of Trial Functions in a Deep Neural Network 
+
+In this paper, we exclusively use the densely connected neural network architecture, also known as the multi-layer perceptron (MLP), which has its origin in the early works of Rosenblatt [86]. MLPs consist of multiple layers of neurons, where each neuron has the task of converting its input to an output by generally passing it through a nonlinear function called activation. MLPs are characterized by an architecture where neurons in a given layer are connected densely to the neurons in the neighboring layers (Fig. 14). We note in passing, however, that the latest revolution in deep learning began with a di ff erent neural network architecture—convolutional neural networks (CNN)—applied to image classification tasks [87]. Independent of research in PINNs, some of these modern architectures have also been applied to mechanics problems [88]. In keeping with the broadly accepted definition, we consider any deep network to have two or more hidden layers. In a standard collocation or Ritz method, the trial function is expanded as a linear combination of known basis functions. The point of departure in using deep neural networks is that the ansatz herein is represented by a nonlinear map that consists of unknown parameters. These parameters are obtained via the solution of a minimization problem, which in general but not necessarily, is a least squares optimization problem. Once the parameters are determined, one obtains the numerical solution to the boundary-value problem. 16 6.1. Feedforward neural network 
+
+Given a point x ∈ Rd, we use a multilayer feedforward deep neural network to construct ˜ uR
+
+> nn
+
+(x; θ), which is then used to build ˜ ubc nn (x), the approximation to u(x) : Rd → R. The layers in between the input and output layer are known as hidden layers. Each hidden layer consists of neurons (hidden units), and each neuron in a hidden layer takes its input from the neurons in the preceding layer and computes its own activation. A network diagram of neural networks with one, two, and four hidden layers is shown in Fig. 14. In this paper, we consider the following boundary-value problem: 
+
+Lu(x) = f (x) in Ω ⊂ Rd, (36a) 
+
+Buu(x) = g(x) on Γu, (36b) 
+
+Bnu(x) = h(x) on Γn, (36c) where L is in general a di ff erential operator plus the identity and d is the spatial dimension. In (36), Ω is an open bounded domain with boundary ∂Ω = Γ u ∪Γn and Γu ∩Γn = ∅. Equation (36b) represents essential boundary conditions for second- and fourth-order problems, and (36c) imposes natural and higher order boundary conditions. For a second-order problem, (36c) is a Robin boundary condition. As noted in Section 5, we form the trial function to approximate 
+
+u(x) as a combination of terms that involve the approximate distance functions (φk(x), φ(x)) and the neural network approximation. Consider a neural network that consists of L hidden layers with N` neurons in the hidden layer `
+
+and activation function σ : R → R. The size of the neural network is: N = ∑L 
+
+> `=1
+
+N`. Let ˜ uR
+
+> nn
+
+(x; θ) be the PINN approximation, where θ := {W, b} is the unknown parameter vector, with weights W` ∈ RN` ×N `−1 and biases b` ∈ RN` .We write ˜ uR
+
+> nn
+
+(x; θ) via the composition of T (`) (` = 1, 2, . . . , L) and a linear map G as: ˜uR
+
+> nn
+
+(x; θ) = G ◦ T (L) ◦ T (L−1) ◦ . . . ◦ T (1) (x), (37) where G : RNL → R is the linear mapping for the output layer and in each hidden layer ( ` = 1, 2, . . . , L), the nonlinear mapping is: 
+
+T (`)(z) = σ(W` · z + b`), (38) where z ∈ RN`−1 . For a neural network with activation function σ and a single hidden layer that consists of N neurons, the PINN approximation is: ˜uR
+
+> nn
+
+(x; θ) =
+
+> N
+
+∑
+
+> i=1
+
+ci σ(wi · x + bi), (39) where wi ∈ Rd and bi, ci ∈ R. It is known that a multilayer neural network is su ffi ciently rich to be able to approximate any L2 function to arbitrary accuracy [66, 67, 89]. However, realization of this in practice hinges on the width and depth of the network, choice of the activation function and the computational algorithm to solve the optimization problem. 
+
+6.2. Backpropagation algorithm 
+
+Determination of the optimal parameters of the network is done through an iterative optimization process called backpropagation and a popular algorithm for backpropagation is stochastic gradient descent, which is the stochastic version of the gradient descent algorithm. An important step in this procedure is the e ffi cient computation of the gra-dient of the loss function using automatic di ff erentiation. In this paper, we exclusively use the Adam backpropagation algorithm whose details are given in [91]. Adam is an extension of the stochastic gradient descent algorithm and di ff ers from it primarily in its implementation of per-parameter learning rates that are continuously tweaked during learning through both the first and second moments of the gradients. 
+
+7. Formulations 
+
+We now present the formulations for deep collocation and deep Ritz for second- and fourth-order problems. In the collocation approach, all boundary conditions are exactly satisfied; for the Ritz method, the essential boundary conditions are met. 17 (a)     
+
+> (b)
+> (c)
+> Figure 14: Deep neural network with (a) one, (b) two, and (c) four hidden layers [90]. Input layer is a point x∈R2, and the output layer is the PINN approximation, ˜ uR
+> nn (x;θ).
+
+7.1. Deep collocation 
+
+Let us consider a second-order boundary-value problem with mixed boundary conditions. We require the trial function given in (35) (meets all boundary conditions) to also satisfy the governing equation in (36a) at NI interior collocation points. We label these collocation points as {xk}NI
+
+> k=1
+
+. When substituted in (36a), this defines a residual error at each point xk and to determine the parameter θ, we minimize the mean squared residual error: 
+
+θ∗ = arg min 
+
+> θ
+
+Lbc nn (θ), Lbc nn (θ) : = ||L ˜ubc nn (x; θ) − f (x)|| 2 
+
+> Ω,NI
+
+= 1
+
+NINI∑
+
+> k=1
+
+[L˜ubc nn (xk; θ) − f (xk)]2
+
+, (40) where Lbc nn (θ) is known as the loss function, and || · || Ω,NI denotes the mean discrete L2 norm of its argument over the domain Ω that is discretized using NI collocation points. The Adam algorithm [91] is used to solve (40). When a standard PINN trial function, ˜ unn (x; θ), is used, additional residual error contributions from the boundary conditions are present in the loss function. If only essential boundary conditions are imposed, with u = g on ∂Ω, then the solution for the parameter θ is given by 
+
+θ∗ = arg min 
+
+> θ
+
+Lnn (θ), Lnn (θ) : = ||L ˜unn (x; θ) − f (x)|| 2 
+
+> Ω,NI
+
++ || ˜unn (x; θ) − g(x)|| 2 
+
+> ∂Ω,NB
+
+, (41) where ||·|| ∂Ω,NB is the mean discrete L2 norm of its argument over the boundary ∂Ω, and NB is the number of collocation points on ∂Ω. As noted in Section 1, the first and second terms in Lnn (θ) are referred to as the interior (PDE) loss and the boundary loss, respectively. 
+
+7.2. Deep Ritz 
+
+E and Yu [9] introduced the deep Ritz method to solve low- and high-order boundary-value problems that have a variational structure. Samaniego et al. [92] have applied the Ritz approach to solve problems in computational solid mechanics. We consider second-order (Poisson) and fourth-order (plate bending) boundary-value problems. Essential and mixed boundary conditions are considered for the Poisson problem and clamped boundary conditions are imposed for the plate bending problem. We use a variational principle (minimization of the potential energy functional) to solve both problems. A trial function, ˜ ubc nn (x; θ), from a finite-dimensional space is used in the variational principle, which forms the basis of the deep Ritz method. 
+
+7.2.1. Second-order problems 
+
+Referring to the model boundary-value problem in (36), we first consider a Poisson problem with Dirichlet bound-ary conditions: 
+
+−∇ 2u = f in Ω, (42a) 
+
+u = g on ∂Ω. (42b) 18 The variational principle for this problem is: min 
+
+> u∈S
+
+[
+
+Π[u] : = 12 a(u, u)
+
+︸ ︷︷ ︸
+
+> Wint [u]
+
+− `(u)
+
+︸︷︷︸ 
+
+> Wext [u]
+
+, S =
+
+{
+
+u : u ∈ H1(Ω), u = g on ∂Ω
+
+} ] 
+
+, (43a) where 
+
+a(u, w) =
+
+∫
+
+> Ω
+
+∇u · ∇ w d x, `(w) =
+
+∫
+
+> Ω
+
+f wd x (43b) are a symmetric bilinear functional and a linear functional, respectively, and u and w are the trial and test functions, respectively. In (43a), Wint [u] is the internal work (strain energy) and Wext [u] is the external work, and Hk(Ω) is the Sobolev space that consists of functions that have square integrable derivatives up to order k in Ω.As a second problem, we consider a Poisson problem with mixed (Dirichlet and Robin) boundary conditions: 
+
+−∇ 2u = f in Ω, (44a) 
+
+u = g on Γu, ∂u
+
+∂n + cu = h on Γn, (44b) where n is the unit outward normal on Γn, c := c(x) and h := h(x) are boundary data, and Γ = ∂Ω with Γ = Γ u ∪ Γn
+
+and Γu ∩ Γn = ∅. The variational principle for this problem is: min 
+
+> u∈S
+
+[
+
+Π[u] : = 12 a(u, u) − `(u), S =
+
+{
+
+u : u ∈ H1(Ω), u = g on ∂Ω
+
+} ] 
+
+, (45a) where 
+
+a(u, w) =
+
+∫
+
+> Ω
+
+∇u · ∇ w dx +
+
+∫
+
+> Γn
+
+cuw dS , `(w) =
+
+∫
+
+> Ω
+
+f w d x +
+
+∫
+
+> Γn
+
+hw dS . (45b) As indicated in (27), we choose a kinematically admissible trial function so that the essential boundary condition is satisfied. On substituting the PINN trial function in (43) or (45), the finite-dimensional minimization problem becomes min 
+
+> θ
+
+[
+
+Π[˜ ubc nn ] : = 12 a (˜ubc nn (x; θ), ˜ubc nn (x; θ)) − ` (˜ubc nn (x; θ))]
+
+. (46) To compute the potential energy functional, we use a Monte Carlo integration rule with points that are distributed (randomly or quasi-uniformly) in the domain with a constant weight that is attached to each point. To obtain the unknown parameters, we solve a discrete nonlinear minimization problem that is posed as: 
+
+θ∗ = arg min 
+
+> θ
+
+Lbc nn (θ), Lbc nn (θ) = 1
+
+NINI∑
+
+> k=1
+
+[ 12 a (˜ubc nn (xk; θ), ˜ubc nn (xk; θ)) − ` (˜ubc nn (xk; θ))]
+
+, (47) where Lbc nn (θ) is the loss function, and a(·, ·) and `(·) are given in (43b) and (45b) for the Poisson problems with Dirichlet and mixed boundary conditions, respectively. In general, the stochastic gradient descent algorithm or a variant of it is used to solve (47). 
+
+7.2.2. Fourth-order problem 
+
+We consider the fourth-order problem of Kirchho ff plate bending with clamped boundary conditions. The strong form of the boundary-value problem is: 
+
+∇4u = f in Ω ⊂ R2, (48a) 
+
+u = 0 on ∂Ω, u,n := ∂u
+
+∂n = 0 on ∂Ω, (48b) where u := u(x, y) is the out-of-plane plate deflection, f is the transverse load per unit area, and n is the unit outward normal on the boundary ∂Ω. The flexural rigidity of the plate is assumed to be unity. 19 The variational principle that is associated with the strong form in (48) is: min 
+
+> u∈S
+
+[
+
+Π[u] : = 12 a(u, u) − `(u), S =
+
+{
+
+u : u ∈ H2(Ω), u = 0 on ∂Ω, u,n = 0 on ∂Ω
+
+}] 
+
+, (49a) where 
+
+a(u, w) =
+
+∫
+
+> Ω
+
+(∇2u)( ∇2w) dx, `(w) =
+
+∫
+
+> Ω
+
+f w d x, (49b) and now both essential boundary conditions in (48b) must be met to satisfy kinematic admissibility. To meet this objective, the solution structure for the PINN trial function is chosen as: ˜ubc nn (x; θ) = [φ(x)] 2 ˜uR
+
+> nn
+
+(x; θ), (50) where φ(x) is an approximate distance function to the boundary ∂Ω.On substituting the trial function in (49), the finite-dimensional minimization problem is of the same form as (43), where a(·, ·) and `(·) are now given by (49b). Similar to the deep Ritz approach for the Poisson problem, we use a Monte Carlo integration rule with equally-weighted points. Following the same steps, the loss function is of the same form as in (47) with the bilinear and linear functionals defined in (49b). 
+
+8. Numerical Implementation 
+
+Numerical solutions using deep neural networks are presented for one-, two- and four-dimensional boundary-value problems. Both, second- and fourth-order problems are considered. The trial function, ˜ ubc nn (x; θ), contains ˜ uR
+
+> nn
+
+(x; θ), the approximation that is composed by the neural network. The trial function in the standard PINN [6] is denoted by ˜unn (x; θ). Likewise, Lbc nn (θ) is used to denote the loss function in our approach and Lnn (θ) is that for standard PINN. We refer to the numerical solution that is obtained (after training) by our approach as ˜ ubc nn (x) and that obtained (after training) using standard PINN [6] as ˜ unn (x). Deep collocation and deep Ritz methods are used to solve boundary-value problems. For the Ritz method, either ReLU or cubic ReLU activation function is used, with the problem that is solved in Section 9.1.7 being the sole exception where the Gaussian activation function is used. Unless otherwise stated, R-equivalence (REQ) composition with m = 1 and mean value potential (MVP) with p = 1 are used to form approximate distance functions in ˜ ubc nn (x; θ). For all problems that are considered in this paper, whenever REQ is indicated, we employ (10) to form φ(x); when MVP is mentioned, the expression for φ given in (14) or (16) applies. For collocation, all boundary conditions are exactly satisfied, and essential boundary conditions are exactly met in the Ritz method, which ensures that the trial function is kinematically admissible. All collocation points are considered in a single batch in the network; points are not sorted in bins and passed in batches, which in general is a more e ffi cient approach. The formulation described in Section 7 has been implemented using Google’s JAX library in Python [93], which can automatically di ff erentiate native Python and NumPy functions. As an example, the code listing below sets up the calculation for the approximate distance function over an arbitrary polygon through the R-equivalence operation in (10). 
+
+def dist(x1,y1,x2,y2): 
+
+return jax.numpy.sqrt((x2-x1)**2+(y2-y1)**2) 
+
+def linseg(x,y,x1,y1,x2,y2): L = dist(x1,y1,x2,y2) xc = (x1+x2)/2. yc = (y1+y2)/2. f = (1/L)*((x-x1)*(y2-y1) - (y-y1)*(x2-x1)) t = (1/L)*((L/2.)**2-dist(x,y,xc,yc)**2) varphi = jax.numpy.sqrt(t**2+f**4) phi = jax.numpy.sqrt(f**2 + (1/4.)*(varphi - t)**2) 
+
+return phi 
+
+def phi(x,y,segments): 
+
+20 m = 1. R = 0. for i in range(len(segments[:,0])): phi = linseg(x,y,segments[i,0],segments[i,1],segments[i,2],segments[i,3]) R = R + 1./phi**m R = 1/R**(1/m) 
+
+return R
+
+Here segments is a NumPy array, which contains the coordinates of the line segments that make up the polygon and phi(x,y,segments) returns an approximate distance function from the location (x,y) to the polygon. A multilayer perceptron neural network is created using the following set of functions: 
+
+def RePU3(x): 
+
+return (jax.numpy.maximum(0, x**3)) 
+
+def repu3_layer(params, x): 
+
+return RePU3(jax.numpy.dot(params[0], x) + params[1]) 
+
+def NN(params, x, y): """ Compute the forward pass for each example individually """ activations = jax.numpy.array([x,y]) # Loop over the RePU3 hidden layers for w, b in params[:-1]: activations = repu3_layer([w, b], activations) final_w, final_b = params[-1] final = jax.numpy.sum(jax.numpy.dot(final_w, activations)) + final_b 
+
+return (final[0]) 
+
+In the above, the cubic ReLU function is used as the activation in the hidden layers, which is also an instance of the Rectified Power Unit (RePU) activation function RePU n(x) : = [max(0 , x)] n (n = 3 for cubic ReLU). A linear activation function is used in the output layer. Here, params is a NumPy array consisting of the optimizable parameters, (w,b) ,of the network, and its shape is determined by the architecture of the network. A network architecture that has been used frequently in this paper for problems in Rd is a 2 hidden layer network d–N–N–1, where N is the number of neurons in each hidden layer and is typically 50. Finally, the following functions and their derivatives are used to construct the ansatz ˜ ubc nn (x; θ), which satisfies homogeneous essential boundary conditions that are imposed on the boundary of a polygonal domain: 
+
+def u(params, x, y): 
+
+return phi(x,y,segments)*NN(params,x,y) #Examples of first-order partial derivatives gradx = grad(u,1) grady = grad(u,2) #Examples of second-order partial derivatives gradxx = grad(grad(u,1),1) gradyy = grad(grad(u,2),2) gradxy = grad(grad(u,1),2) 
+
+Once the derivatives are formed, the appropriate loss functions are constructed in the interior of the domain to solve the problem through Ritz or collocation. For all problems that are solved in this paper, network training is done using Google’s Colaboratory cloud platform [94]. Single-precision arithmetic is used in the computations. An important consideration is the generation of collocation points in the interior and on the boundary of the domain to evaluate the loss terms. When the domain is simple—for example, a square or a hypercube—then these points are 21 −1 0 1      
+
+> −1.0
+> −0.50.00.51.0
+> −101
+> −1.0
+> −0.50.00.51.0
+> 0.00.51.00.00.20.40.60.81.0Figure 15: Representative meshes generated from dmsh and the corresponding collocation points.
+
+generated on a uniform grid in Rd. This is the case for the solution of the heat equation on square domains, Eikonal problems in Section 10.5, and for the Poisson problem over the 4-dimensional hypercube in Section 11. In other examples that involve more complicated domains, we use the Python library dmsh [95], which draws inspiration from distmesh [96], to create triangular meshes and we use the centroids of the generated triangles as the interior collocation points. Figure 15 shows a few representative meshes generated by dmsh that are used in this paper. The corresponding interior collocations points are shown as dots. When needed, dmsh is also used to create collocation points on the boundaries of the domain. 
+
+9. Numerical Examples in One Dimension 
+
+We consider several second-order problems that involve essential and mixed boundary conditions. As a prototyp-ical fourth-order problem, we solve the deflection for a clamped Euler-Bernoulli beam, and also investigate a single hidden layer meshfree RBF-network solution in Section 9.1.7. Our objective is to demonstrate the benefits of the new formulation vis-` a-vis standard PINN [6] in which equal weights are chosen for the PDE and boundary loss terms. Hence, our emphasize is not on obtaining the most accurate PINN solution by finding the optimal hyperparameters nor comparing its performance versus the finite element method for forward problems. We compare our results with those obtained using standard PINN [6] and to the exact solution. 
+
+9.1. Deformation of a homogeneous elastic rod 
+
+Consider the boundary-value problem for the deformation of an elastic rod (Youngs’s modulus and cross-sectional area are taken as unity): 
+
+u′′ + b = 0 in Ω = (x1, x2) (51a) 
+
+u(x1) = g, u′(x2) + cu (x2) = h, (51b) where u := u(x) is the axial displacement field, u′(x) is the strain field, b := b(x) is the axial body force per unit length, and c, g and h are constants. The second boundary condition is a Robin boundary condition; if the bar is connected to a spring that is attached to a fixed end, then h = 0. We select test problems with di ff erent boundary conditions, and vary the regularity of b(x) from it being a smooth function to a δ-function, and even choose b(x) that has a singularity at the origin. 
+
+9.1.1. Example 1 
+
+As the first example, a Dirichlet problem in Ω = (0 , 1) is selected with body force b(x) = 1 − 2x + 10 x2, and boundary conditions u(0) = 1/2 and u(1) = −1/2. The exact solution is: u(x) = 1/2 − x2/2 + x3/3 − 10 x4/12. The exact signed distance functions to x = 0 and x = 1 are φ1(x) = x and φ2(x) = 1 − x, respectively. Now, we join these to form a smooth approximate distance function to the boundary ∂Ω = {0, 1}. On using the product of φ1 and φ2, and 22 the R-conjunction ( α = 0) and R-equivalence ( m = 2) relations in (2) and (9), respectively, we obtain the following combined distance functions: 
+
+φA(x) = φ1(x)φ2(x), φB(x) = φ1(x) + φ2(x) −
+
+√
+
+φ21(x) + φ22(x), φC (x) = φ1(x)φ2(x)
+
+√
+
+φ21(x) + φ22(x)
+
+. (52) Note that in one dimension, the product formula is normalized to order 1, but this does not generalize to higher dimensions. Coincidentally, for any domain Ω = (x1, x2) in one dimension, the product formula scaled by L := x2 − x1
+
+coincides with m = 1 in the R-equivalence relation. In the numerical computations, the trial function is formed using (22): ˜ubc nn (x; θ) = g(x) + φ(x) ˜ uR
+
+> nn
+
+(x; θ), g(x) = 1 − 2x
+
+2 , (53) where g(x) is formed using the transfinite interpolant in (26), ˜ uR
+
+> nn
+
+(x; θ) is the neural network approximation, and φ(x)is chosen to be either φA(x), φB(x) or φC (x). Note that we did not solve the patch test ( b = 0) since the exact solution, 
+
+u(x) = g(x), is already captured by the presence of g(x) in (53). The network architecture 1–30–30–1 is used. We compute collocation solutions using φA, φB and φC in (53), and also for the standard PINN approximation, ˜ unn (x; θ) [6]. In Fig. 16, the numerical results are presented. All approaches are able to reach losses of the same order, and there is a good match between ˜ ubc nn (x) and the exact displacement field. From Fig. 16d, we observe that the errors in the displacement and strain fields using ˜ ubc nn are uniformly smaller than those obtained using ˜ unn . This di ff erence stems from the fact that in our approach the trial function is constructed with the exact satisfaction of the boundary conditions. 0 2000 4000 Epochs 10 −2         
+
+> 10 −1
+> 10 0
+> 10 1
+> Training loss
+> φA
+> φB
+> φC
+> ˜unn
+> 02000 4000 Epochs 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error
+> φA
+> φB
+> φC
+> ˜unn
+> 0.00 0.25 0.50 0.75 1.00
+> x
+> −0.6
+> −0.4
+> −0.20.00.20.40.6
+> u(x), ˜ubc nn (x)
+> exact ˜ubc nn (φB)0.00.51.0
+> x
+> −0.02
+> −0.01 0.00 0.01 0.02 Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+> u- ˜ unn
+> u′- ˜ unn ′
+
+(a) (b) (c) (d)                    
+
+> Figure 16: Collocation solutions for a Dirichlet problem. The body force b(x)=1−2x+10 x2, and the essential boundary conditions are u(0) =1/2and u(1) =−1/2. The network architecture is 1–30–30–1. Numerical solutions, ˜ ubc nn (x), are computed using di ff erent ADFs ( φA,φB,φC), and are compared to the solution obtained using standard PINN, ˜ unn (x). (a), (b) Training loss and normalized absolute error in the displacement field as a function of epochs. (c) Exact solution u(x) and ˜ ubc nn (x) (using φB) are shown, and (d) Errors in the displacement and strain fields are compared.
+
+We emphasize that it is possible for both standard PINN and our formulation to deliver better accuracy if a larger network, more interior collocation points, and network training for a longer duration are chosen. This is realized at the expense of more computing time. So for just this example, we demonstrate the same. We now use a 1–50–50–1 network architecture with 300 interior collocation points. The training is conducted until 50,000 epochs. The solutions are presented in Fig. 17, which reveal that both approaches are now much more accurate than the results that are shown in Fig. 16. From Fig. 17, we observe that the solution obtained using standard PINN and our approach have relative errors of O(10 −5) and O(10 −6), respectively. So even at smaller relative errors, we note that our approach is more accurate than standard PINN for the same values of the hyperparameters. 
+
+9.1.2. Example 2 
+
+We reconsider the problem posed in Example 1, with Dirichlet boundary condition at x = 0 but homogeneous Neumann (traction-free) condition at x = 1. For the domain Ω = (0 , 1), body force b(x) = 1 − 2x + 10 x2, and boundary conditions u(0) = 1/2 and u′(1) = 0, the exact solution is: u(x) = 1/2 + 10 x/3 − x2/2 + x3/3 − 10 x4/12. The trial function, ˜ ubc nn (x; θ), is formed using (33), with φ1(x) = x, φ2(x) = 1 − x, φ(x) = x(1 − x), g = 1/2, and c = h = 0: ˜ubc nn (x; θ) = x ˜uR
+
+> nn
+
+(x; θ) + x(1 − x) [(1 − x) ˜ uR
+
+> nn
+
+(x; θ) + {x ˜uR
+
+> nn
+
+(x; θ)}′
+
+> x=1
+
+] + 12 . (54) 23 0 20000 40000 Epochs 10 −6         
+
+> 10 −4
+> 10 −2
+> 10 0
+> Training loss
+> φA
+> φB
+> φC
+> ˜unn
+> 020000 40000 Epochs 10 −5
+> 10 −3
+> 10 −1
+> Error
+> φA
+> φB
+> φC
+> ˜unn
+> 0.00 0.25 0.50 0.75 1.00
+> x
+> −0.6
+> −0.4
+> −0.20.00.20.40.6
+> u(x), ˜ubc nn (x)
+> exact ˜ubc nn (φB)0.00.51.0
+> x
+> −2
+> −10Prediction errors
+> ×10 −5
+> u- ˜ ubc nn
+> u- ˜ unn
+
+(a) (b) (c) (d) 
+
+Figure 17: Collocation solutions for the Dirichlet problem that is presented in Example 1. The network architecture is 1–50–50–1 with 300 interior collocation points. The captions for (a), (b), (c), (d) mirror those shown in Fig. 16. 
+
+The network architecture 1–50–50–1 is used, since the network 1–30–30–1 did not converge for standard PINN. The collocation solutions ˜ ubc nn (x) and ˜ unn (x) are compared to the exact solution u(x) in Fig. 18. We observe from Fig. 18a that the training loss for ˜ ubc nn (x; θ) is about two orders smaller than ˜ unn (x; θ), whereas in Fig. 18b, the corresponding normalized absolute error is one order smaller. The numerical solution ˜ ubc nn (x) is in excellent agreement with the exact solution in Fig. 18c. Over the entire interval x ∈ [0 , 1], we find that the displacement and strain fields from ˜ ubc nn (x) are markedly more accurate than those from ˜ unn (x) (see Fig. 18d). 0 2500 5000 7500 Epochs 10 −3       
+
+> 10 −1
+> 10 1
+> Training loss ˜ubc nn
+> ˜unn
+> 02500 5000 7500 Epochs 10 −4
+> 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error ˜ubc nn
+> ˜unn
+> 0.00.51.0
+> x
+> 0.51.01.52.02.5
+> u(x), ˜ubc nn (x)
+> exact ˜ubc nn
+> 0.00.51.0
+> x
+> −0.0075
+> −0.0050
+> −0.0025 0.0000 0.0025 Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+> u- ˜ unn
+> u′- ˜ unn ′
+
+(a) (b) (c) (d) 
+
+Figure 18: Collocation solutions for a Neumann problem. The body force b(x) = 1 − 2x + 10 x2, and u(0) = 1/2 and u′(1) = 0. The network architecture is 1–50–50–1. Numerical solution using ˜ ubc nn (x; θ) and ˜ unn (x; θ) are compared. (a), (b) Training loss and normalized absolute error as a function of epochs. (c) ˜ ubc nn (x) is compared to the exact solution. (d) Errors in the displacement and strain fields for ˜ ubc nn (x) and ˜ unn (x). 
+
+9.1.3. Example 3 
+
+For this example, we choose Ω = (0 , 1), a sinusoidal body force b(x) = − sin( kπx) with varying k, and essential boundary conditions u(−1) = u(1) = 0. The exact solution is: u(x) = − sin( kπx)/π 2k2. This example serves to reveal the spectral (low-frequency) bias [97, 98] of neural network approximations. In Fig. 19, the numerical results are presented. For k = 1 and using standard PINN, we found that the normalized absolute error for the networks 1–30–30–1 and even 1–100–100–1 did not converge; it took a network architecture of 1–100-100-100-1 for the error to be comparable to that obtained using ˜ ubc nn (x; θ) on a 1–30–30–1 architecture. The normalized absolute errors as a function of epochs is shown in Fig. 19a. The numerical solutions are compared to the exact solution u(x) in Fig. 19b, and we notice the poor solution that is generated by ˜ unn (x; θ) on the 1–30–30–1 network. For k = 3, 5 on a 1–100– 100-1 network, the normalized absolute error in the ˜ ubc nn (x) displacement field is shown in Fig. 19c, and ˜ ubc nn (x) and 
+
+u(x) are compared in Fig. 19d for k = 5. Good agreement between ˜ ubc nn (x) and u(x) is realized. We attribute the poor performance of the standard PINN approach to the fact that the boundary conditions are poorly approximated, whereas with ˜ ubc nn (x; θ) the boundary conditions are exactly satisfied. This observation is in broad agreement with the findings of Wang et al. [16], who further analyze the source of this discrepancy by drawing attention to the contributions from the boundary and interior terms in the loss function. For k ≥ 10 with ˜ ubc nn (x; θ), accurate PINN solutions using a single neural network for the entire domain becomes infeasible due to the spectral bias of the neural network approximation; one can adopt a domain-decomposition strategy to obtain accurate numerical solutions for high-frequency problems. 24 0 1000 2000 3000 Epochs 10 −3                    
+
+> 10 −2
+> 10 −1
+> 10 0
+> Error ˜ubc nn
+> ˜unn −1˜unn −2˜unn −3
+> −101
+> x
+> −2
+> −1012
+> u(x), uh(x)
+> exact ˜ubc nn
+> ˜unn −1˜unn −3020000 40000 Epochs 10 −5
+> 10 −4
+> 10 −3
+> 10 −2
+> Error
+> k=3
+> k=5
+> −101
+> x
+> −0.004
+> −0.002 0.000 0.002 0.004
+> u(x), ˜ubc nn (x)
+> exact ˜ubc nn
+
+(a) k = 1 (b) (c) (d) k = 5
+
+Figure 19: Collocation solutions for a homogeneous Dirichlet problem with a sinusoidal body force b(x) = − sin( kπx) with varying k. (a) Normalized absolute error as a function of epochs for k = 1. ˜ ubc nn (x; θ) on a 1–30–30–1 architecture is compared to ˜ unn (x; θ). ˜ unn –1, ˜ unn –2 and ˜ unn –3 are solutions that are obtained on network architectures 1–30–30–1, 1–100–100–1 and 1–100–100–100–1, respectively. (b) ˜ unn (x), ˜ ubc nn (x), and u(x) are plotted for k = 1. (c) Normalized absolute error during training using ˜ ubc nn (x; θ) for k = 3 and k = 5. Network architecture is 1–100–100–1. (d) For k = 5, error in the displacement field for ˜ unn (x). 
+
+9.1.4. Example 4 
+
+Consider an elastic rod that occupies Ω = (−1, 1) and is subjected to a discontinuous body force b(x) = H(x), where 
+
+H(x) is the Heaviside function. Essential boundary conditions are imposed at both ends: u(−1) = 0 and u(1) = −1/2. The exact solution, u(x) ∈ C1(Ω), is: 
+
+u(x) =
+
+
+
+0 , −1 ≤ x < 0
+
+− x2 
+
+> 2
+
+, 0 ≤ x ≤ 1 . (55) Numerical computations are performed on a 1–50–50–1 network. Numerical results for ˜ unn and ˜ ubc nn are presented in Fig. 20. The training loss of ˜ ubc nn (x; θ) at 10,000 epochs is O(10 −3) but ˜ unn (x; θ) stagnates to a value just below 0 .1. These losses correspond to normalized absolute errors of O(10 −5) and O(10 −4) at the end of the training for ˜ ubc nn (x) and ˜unn (x), respectively (see Fig. 20b). Figures 20c and 20d reveal that ˜ ubc nn (x) is in good agreement with u(x), whereas the errors in ˜ unn (x) are significant, and are especially pronounced in the vicinity of x = 0. The PDE loss is dominant over the boundary losses. We point out that if both boundary conditions are homogeneous then the accuracy of ˜ unn (x) is comparable to that of ˜ ubc nn (x). 0 5000 10000 Epochs 10 −3      
+
+> 10 −2
+> 10 −1
+> Training loss ˜ubc nn
+> ˜unn
+> 05000 10000 Epochs 10 −4
+> 10 −3
+> 10 −2
+> 10 −1
+> Error ˜ubc nn
+> ˜unn
+> −101
+> x
+> −0.5
+> −0.4
+> −0.3
+> −0.2
+> −0.10.0
+> u(x), uh(x)
+> exact ˜ubc nn
+> ˜unn
+> −101
+> x
+> −0.10
+> −0.05 0.00 0.05 Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+> u- ˜ unn
+> u′- ˜ unn ′
+
+(a) (b) (c) (d) 
+
+Figure 20: Collocation solutions for a Dirichlet problem with a discontinuous body force, b(x) = H(x). Essential boundary conditions are u(−1) = 0and u(1) = −1/2, and the network architecture is 1–50–50–1. (a), (b) Training loss and normalized absolute error as a function of epochs for ˜ubc nn (x; θ) and ˜ unn (x; θ). (c) Comparisons of ˜ ubc nn (x) and ˜ unn (x) with the exact solution. (d) Error in the displacement and strain fields. 
+
+9.1.5. Example 5 
+
+Let us consider an elastic rod that occupies Ω = (−1, 1) and is subjected to a unit point load at the origin, i.e., 
+
+b(x) = δ(x), where δ(x) is the δ-function. Homogeneous essential boundary condition is prescribed at x = −1 and traction-free conditions prevail at x = 1. The exact solution, u(x) ∈ C0(Ω), is: 
+
+u(x) =
+
+
+
+1 + x , −1 ≤ x < 01 , 0 ≤ x ≤ 1 , (56) 25 which has a kink at x = 0. We use the deep Ritz method with ˜ ubc nn (x; θ) to solve this problem since it is not possible to solve this problem using the collocation method ( δ(x) is a distribution that is defined over a zero measure). The network architecture is 1–50–50–1. Numerical results for ˜ ubc nn are presented in Fig. 21. From Fig. 21a, we observe that the training loss converges to a value close to −0.5 in a few epochs; this corresponds to a small normalized absolute error of O(10 −3) (see Fig. 21b). On using u given in (56), we find that the potential energy of the exact solution is: 
+
+Π[u] = 12
+
+∫ 1
+
+> −1
+
+u′2 dx − u(0) = − 12 ,
+
+which is the target loss that Π[˜ ubc nn ] seeks to attain. Figure 21c shows excellent agreement between ˜ ubc nn (x) and the exact solution. The errors in the displacement and strain fields are presented in Fig. 21d. The errors in u are uniformly small for all x ∈ [0 , 1]. The errors in the strain field follow the same trends, but have larger errors in the vicinity of the origin. This is not surprising since ˜ ubc nn (x) is C2(Ω) (cubic ReLU activation function), whereas the exact solution u ∈ C0(Ω), with u′(x) being discontinuous at x = 0. This is also the source for the small discrepancy in the Ritz energy loss. 0 2500 5000 7500 Epochs        
+
+> −0.5
+> −0.4
+> −0.3
+> −0.2
+> −0.10.0Training loss 02500 5000 7500 Epochs 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error
+> −101
+> x
+> 0.00.20.40.60.81.0
+> u(x)
+> exact ˜ubc nn
+> −101
+> x
+> −0.3
+> −0.2
+> −0.10.00.10.2Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+
+(a) (b) (c) (d)          
+
+> Figure 21: Ritz solution for a point load, b(x)=δ(x). Homogeneous essential boundary condition is imposed at x=−1 and traction-free conditions prevail at x=1. The network architecture is 1–50–50–1. (a), (b) Training (Ritz) loss and normalized absolute error as a function of epochs. (c) Comparison of ˜ ubc nn (x) with the exact solution. (d) Errors in displacement and strain fields.
+
+9.1.6. Example 6 
+
+To obtain a weakly singular solution for the elastic rod problem, we consider a body force that has a singularity at the origin. We choose Ω = (0 , 1) with b(x) = 2x−4/3/9 with essential boundary conditions u(0) = 0 and u(1) = 1. The exact solution is u(x) = x2/3, and u ∈ H1(Ω) is weakly singular. This problem is solved using the collocation approach on a 1–50–50–1 network architecture. Numerical results for ˜ ubc nn (x) and ˜ unn (x) are presented in Fig. 22. The training loss of ˜ ubc nn (x; θ) at 10,000 epochs is close to O(10 −4) and ˜ unn (x; θ) is more than two orders larger. This same trend is observed in the normalized absolute error as a function of epochs (see Fig. 22b). Figures 22c and 22d show that ˜ ubc nn (x)is in fairly good agreement with u(x), whereas the error in the displacement and strain fields of ˜ unn (x) are appreciable. It appears that ˜ unn (x) and u(x) di ff er by close to an a ffi ne function, which one can infer as being present within ˜ unn (x)to meet the essential boundary conditions. 
+
+9.1.7. Example 7 
+
+To draw connections to meshfree methods based on RBFs [46] and local maximum-entropy approximants [48], which are discussed in Section 1, we solve (51) with inhomogeneous Dirichlet boundary conditions. The domain 
+
+Ω = (0 , 1), and we choose the exact solution as: 
+
+u(x) =
+
+> 2
+
+∑
+
+> i=1
+
+exp [−γi(x − ai)2] , (57) which is a sum of two Gaussian functions, and γi and ai (i = 1, 2) are constants. The body force b(x) = −u′′ (x). Essential boundary conditions are imposed at x = 0 and x = 1 that are consistent with the exact solution in (57). In the numerical computations, we choose a1 = 1/4, a2 = 6/10, γ1 = 9 and γ2 = 10. The network architecture is 1–10–1 (1 hidden layer). For the hidden layer, we select a Gaussian activation function, σ(x) = exp( −x2), and a linear activation function for the output layer. The centers of the Gaussian are chosen to be fixed, and only the support-widths 26 0 5000 10000 Epochs 10 −3      
+
+> 10 −1
+> 10 1
+> Training loss ˜ubc nn
+> ˜unn
+> 05000 10000 Epochs 10 −2
+> 10 −1
+> 10 0
+> Error ˜ubc nn
+> ˜unn
+> 0.00.51.0
+> x
+> 0.00.20.40.60.81.0
+> u(x), uh(x)
+> exact ˜ubc nn
+> ˜unn
+> 0.00.51.0
+> x
+> −0.2
+> −0.10.00.1Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+> u- ˜ unn
+> u′- ˜ unn ′
+
+(a) (b) (c) (d) 
+
+Figure 22: Collocation solutions for a Dirichlet problem with a singular body force, b(x) = 2x−4/3/9. Essential boundary conditions are u(0) = 0and u(1) = 1, and the network architecture is 1–50–50–1. (a), (b) Training loss and normalized absolute error as a function of epochs for ˜ ubc nn (x; θ)and ˜ unn (x; θ). (c) Comparisons of ˜ ubc nn (x) and ˜ unn (x) with the exact solution. (d) Error in the displacement and strain fields. 
+
+of the Gaussians and the weights in the output layer are the unknown parameters in the network. The centers for the neurons are chosen as b = {0, 1/9, 2/9, . . . , 1}. In Fig. 23, the numerical results for ˜ unn (x; θ) using collocation and Ritz are presented. By the end of the training, the loss for the collocation and Ritz solutions are O(10 −1) or better. Note that the loss measures are distinct for the collocation and Ritz solutions; PDE loss is shown for the former whereas it is the (Ritz) energy loss that is presented for the latter. The collocation solution is in good agreement with the exact solution (Fig. 23b). From Fig. 23c, we observe that the errors in the displacement field for both methods are small, but the errors in the derivative (strain fields) are appreciable. With more number of neurons in the hidden layer, it is expected that the accuracy in the strain field will substantially improve. This example reveals the flexibility that the PINN a ff ords in that variational adaptive solutions can be captured by a Gaussian neural network with a single hidden layer. Realizing this is much more di ffi cult using meshfree basis functions, since the underlying Ritz formulation becomes a nonlinear, nonconvex minimization problem. 0 2000 4000 Epochs 10 −1
+
+> Training loss
+
+Collocation Ritz 0.00 0.25 0.50 0.75 1.00 
+
+x
+
+0.25 0.50 0.75 1.00 1.25 1.50 
+
+> u(x), ˜ubc nn (x)
+
+exact ˜ubc nn (Coll.) 0.0 0.5 1.0
+
+x
+
+−1.0
+
+−0.50.00.51.0Prediction errors 
+
+u- ˜ uh (Coll.) 
+
+u′- ˜ uh′ (Coll.) 
+
+u- ˜ uh (Ritz) 
+
+u′- ˜ uh′ (Ritz) 
+
+(a) (b) (c) 
+
+Figure 23: Collocation and Ritz solutions using a Gaussian activation function for a Dirichlet problem with the exact solution as the sum of two Gaussians. Network architecture is 1–10–1. (a) Training loss as a function of epochs for ˜ ubc nn (x; θ) (collocation and Ritz methods). (b) Comparisons of ˜ ubc nn (x) (collocation) with the exact solution. (c) Error in the displacement and strain fields for collocation and Ritz methods. 
+
+9.2. Longitudinal vibrations of a homogeneous elastic rod 
+
+The eigenproblem for the longitudinal vibrations of an elastic bar that is fixed at both ends is: 
+
+u′′ + ω2u = 0 in Ω = (0 , 1) , (58a) 
+
+u(0) = u(1) = 0. (58b) The exact eigenfunctions are: un(x) = sin( ωn x), where ωn = nπ (n ∈ N) are the natural frequencies. The eigenvalue 
+
+λn = ω2 
+
+> n
+
+corresponds to the eigenfunction un(x). 27 We use the Ritz method to solve this problem using ˜ ubc nn (x; θ) and ˜ unn (x; θ). The Rayleigh quotient minimization problem for the smallest eigenvalue (lowest mode) is [9]: min 
+
+> u∈S
+
+∫ 10 u′2 dx 
+
+∫ 10 u2 dx 
+
+, subject to 
+
+∫ 10
+
+u2 dx = 1, (59) where S = {u : u ∈ H1(Ω), u(0) = u(1) = 0} and the normalization constraint on the eigenfunction appears in (59). For the trial function ˜ ubc nn (x; θ), the loss function is: 
+
+Lbc nn (θ) =
+
+∑NI
+
+> k=1
+
+(˜ubc nn ′(xk; θ))2
+
+∑NI
+
+> k=1
+
+(˜ubc nn (xk; θ))2 +
+
+
+
+1
+
+NINI∑
+
+> k=1
+
+(˜ubc nn (xk; θ))2
+
+ − 1
+
+
+
+> 2
+
+, (60) where NI is the number of interior integration points. Note that apart from the PDE loss term, we have an additional loss term due to the normalization constraint in (59). The loss function for ˜ unn (x; θ) consists of the two contributions that appears in (60), and in addition it will include two boundary loss terms to impose the essential boundary conditions. The network architecture 1–50–50–50–1 is used. In Fig. 24, the Ritz solutions for ˜ ubc nn (x) and ˜ unn (x) are presented. The loss function for ˜ ubc nn and ˜ unn saturate to values of 1 and 10, respectively. The error in the natural frequency for ˜ubc nn (x; θ) and ˜ unn (x; θ) are O(10 −4) and O(1) at 10,000 epochs (see Fig. 24b). In Fig. 24c, we compare the lowest mode (eigenfunction) from the numerical solutions to the exact solution: ˜ ubc nn (x) is in much better agreement with the exact solution than ˜ unn (x). The exact mode shape is well-captured by ˜ ubc nn (x) but it is has not been exactly normalized, which leads to the observed discrepancy in the maximum amplitude. 0 5000 10000 Epochs 246810 Training loss ˜ubc nn 
+
+˜unn 
+
+0 5000 10000 Epochs 10 −4
+
+10 −3
+
+10 −2
+
+10 −1
+
+10 0
+
+> Error in frequency
+
+˜ubc nn 
+
+˜unn 
+
+0.0 0.5 1.0
+
+x
+
+0.00.51.01.5Modeshape 
+
+exact ˜ubc nn 
+
+˜unn 
+
+(a) (b) (c)      
+
+> Figure 24: Ritz solutions for the longitudinal vibrations (lowest mode) of a homogeneous elastic rod. The network architecture is 1–50–50–50–1. (a), (b) Training loss and error in lowest natural frequency as a function of epochs for ˜ ubc nn (x;θ) and ˜ unn (x;θ). (c) Comparisons of ˜ ubc nn (x) and ˜ unn (x)for the lowest mode (eigenfunction) that corresponds to the lowest natural frequency ω1=π.
+
+9.3. Advection-di ff usion problem u′′ = αu′ in Ω = (0 , 1) (61a) 
+
+u(0) = 0, u(1) = 1, (61b) where α is the Peclet number, which measures the ratio of the advective rate to the di ff usion rate. The exact solution of the problem posed in (61) is: 
+
+u(x) = eαx − 1
+
+eα − 1 . (62) We choose α = 0, 5, 10 , 50 in this study (pure di ff usion for α = 0 to strongly advective flow for α = 50) and run collocation simulations using ˜ ubc nn (x; θ). For α = 1, 5, 10, the network architecture 1–50–50–1 is used, and for α = 50, the architecture is 1–50–50–50–1. In Fig. 25, the simulation results are presented. For all cases shown in Fig. 25b, we observe that the normalized absolute error during the training is O(10 −3) or less. For all α that are selected, Fig. 25c shows an excellent match between ˜ ubc nn (x) and the exact solutions. 28 0 10000 20000 Epochs 10 −12                     
+
+> 10 −8
+> 10 −4
+> 10 0
+> 10 4
+> Training loss
+> α=0
+> α=5
+> α=10
+> α=50 010000 20000 Epochs 10 −9
+> 10 −7
+> 10 −5
+> 10 −3
+> 10 −1
+> Error
+> α=0
+> α=5
+> α=10
+> α=50 0.00.51.0
+> x
+> 0.00.20.40.60.81.0
+> u(x), ˜ubc nn (x)
+
+(a) (b) (c)           
+
+> Figure 25: Collocation solutions using ˜ ubc nn (x;θ) for the advection-di ff usion problem. For α=1,5,10, the network architecture 1–50–50–1 is used, and for α=50, the architecture is 1–50–50–50–1. (a), (b) Training loss and normalized absolute error as a function of epochs. (c) Comparisons of ˜ubc nn (x) to the exact solutions for di ff erent α. The solid lines are for the exact solutions and the markers (colors are consistent with those shown in (b)) represent the numerical solutions.
+
+9.4. Euler-Bernoulli beam bending 
+
+Consider the boundary-value problem for the deflection of a cantilever (Euler-Bernoulli) beam of unit length that is clamped at both ends and is subjected to a distributed load q(x): 
+
+EIv ′′′′ = q in Ω = (0 , 1) , (63a) 
+
+v(0) = v′(0) = v(1) = v′(1) = 0, (63b) where EI is the flexural rigidity of the beam. We apply a point moment (clockwise orientation) of magnitude M0 at x = 1/2 so that q(x) = M0δ′(x − 1/2). For this q(x), the variational principle that is associated with the strong form in (63) is: min 
+
+> v∈S
+
+[
+
+Π[v] = 12
+
+∫ 10
+
+EI (v′′ )2dx + M0v′(1 /2) , S =
+
+{
+
+v : v ∈ H1(0 , 1) , v(0) = v′(0) = v(1) = v′(1) = 0
+
+}] 
+
+. (64) For this problem, all homogeneous boundary conditions associated with v and v′ that appear in (63b) are essential boundary conditions, and hence a kinematically admissible PINN trial function is given by (50): ˜vbc nn (x; θ) = [φ(x)] 2 ˜vR
+
+> nn
+
+(x; θ), (65) where φ(x) is an ADF (normalized to order 1) that vanishes at x = 0 and x = 1, and its normal derivative has unit magnitude on the boundary. For the computations, we choose EI = 1 and M0 = 1. The exact solution of (63) is: 
+
+v(x) =
+
+[ReLU (x − 1/2) ]2
+
+2 + x2
+
+8 − x3
+
+4 , x ∈ Ω, (66) which is a C1(Ω) (piecewise cubic) function with a moment discontinuity of unit magnitude at x = 1/2. Deep Ritz solutions ˜ ubc nn (x; θ) and ˜ unn (x; θ) are computed on the network architecture 1–50–50–1 using the cubic ReLU activation function, and the results are presented in Fig. 26. From Fig. 26a, we observe that ˜ ubc nn (x; θ) converges to a loss of O(10 −4) at 20,000 epochs but the loss for ˜ unn (x; θ) remains at about 0.1. Figure 26b shows that the numerical solution ˜ ubc nn (x) is in excellent agreement with the exact solution. In Fig. 26c, the error fields are shown: the deflection and rotation fields using ˜ ubc nn (x) are accurate, whereas both fields have large errors for ˜ ubc nn (x). We attribute the poor performance of ˜ unn (x; θ) due to scaling issues of the interior and boundary terms in the loss function. If the ratio of the weights assigned to the boundary loss term and the interior loss terms is set to 10 3 : 1, we find that the results improve but they are still worse than those obtained using ˜ ubc nn (x). It requires a weight ratio of 10 4 for the two solutions to have comparable accuracy. 29 0 10000 20000 Epochs 10 −4      
+
+> 10 −3
+> 10 −2
+> 10 −1
+> Error  ˜ubc nn
+> ˜unn
+> 0.00 0.25 0.50 0.75 1.00
+> x
+> −0.004
+> −0.002 0.000 0.002 0.004
+> u(x), ˜ubc nn (x)
+> exact ˜ubc nn
+> 0.00.51.0
+> x
+> −1.0
+> −0.50.00.5Prediction errors
+> u- ˜ ubc nn
+> u′- ˜ ubc nn ′
+> u- ˜ unn
+> u′- ˜ unn ′
+
+(a) (b) (c)   
+
+> Figure 26: Deep Ritz solutions using ˜ ubc nn (x;θ) and ˜ unn (x;θ) for the Euler-Bernoulli beam problem. The network architecture 1–50–50–1 is used. (a) Normalized absolute eror as a function of epochs. (b) Comparison of ˜ ubc nn (x) with the exact solution. (c) Errors in the deflection and rotation fields.
+
+10. Numerical Examples in Two Dimensions 
+
+The promise of PINN is intriguing for inverse and parameteric (design) problems. But this rests on its accuracy, robustness and reliability on solving the forward problem, which is the emphasis in this contribution. To this end, we focus on the performance of the PINN formulation with exact imposition of boundary conditions versus the standard PINN [6] with equally-weighted loss terms. For the two-dimensional problems, we consider polygonal domains and also domains with curved boundaries. We consider four distinct types of problems: steady-state heat conduction; computation of harmonic (Laplace equation) coordinates [99], which is an instance of generalized barycentric coor-dinates [80]; clamped Kirchho ff plate bending (fourth-order PDE); and the Eikonal equation to compute the signed distance function to a boundary. For these problems, we present our solutions and compare them to either the exact solution (if available) or a reference finite element solution, and to deep collocation [6]. In addition, we identify key di ff erential properties of the approximate distance function and bring to fore the issue of exact satisfaction of boundary conditions and its implications in the training of the network and the accuracy that the PINN approximation delivers. Prior to presenting the numerical examples, it is instructive to understand the properties and behavior of approxi-mate distance functions that are obtained by either R-functions with R-equivalence as presented in Section 3.2 or via mean value potential fields that are discussed in Section 4. Since these functions are used in the PINN ansatz ˜ ubc nn (x; θ)that is presented in Section 5, one must consider the regularity of these functions when used in a deep collocation or a deep Ritz method. 
+
+10.1. Laplacian of approximate distance fields 
+
+For Poisson or Laplace boundary-value problems that involve the Laplace operator, we must understand the be-havior of the Laplacian of the ADFs that stem from R-functions and mean value potentials. Let us consider the unit square, Ω = (0 , 1) 2. The boundary ∂Ω consists of four line segments. On using either (10) or (14), we can construct an approximate distance function to the boundary, φ(x), which is normalized to order 1. Let us refer to these functions as 
+
+φR(x) (REQ) and φM (x) (MVP). In Fig. 27, φ(x) and its Laplacian over the unit square are presented. We observe that both φR and φM are zero on the entire boundary and monotonic (concave) inside the domain. This property of these functions on ∂Ω is used to impose essential boundary conditions, as described in Section 5.2. From Figs. 27c and 27d, we observe that the Laplacians, ∇2φR and ∇2φM , blow up at the vertices of the square. In fact, it is known that both 
+
+∇2φR and ∇2φM are singular at the vertices of a polygon, and therefore very large in magnitude near any of its vertices. Therefore, in a collocation-based approach to solve the Poisson equation, which involves the Laplacian, the contribu-tions to the total loss from regions near the vertices can be very large. This inference does not influence Ritz-based solutions of the Poisson equation since the highest derivative in the variational principle is of order 1, and both φR and 
+
+φM and its first-order derivatives are well-behaved (bounded) over the entire domain. There are two possible remedies to address this issue. The first involves modifying the φi that are obtained from R-equivalence and the ADFs that stem from mean value potential fields, so that the corresponding Laplacians are bounded in the domain. The second is to consider collocation points inside the domain that are not very close to the vertices. We leave the first route as part of future work, and proceed in this paper with the second choice. For instance, we show in Section 10.2.2 that if all 30 interior collocation points are located in Ωδ = [δ, 1 − δ]2 (δ = 0.01), which is a subset of the unit square, then both methods perform well. Finally, we reemphasize that it is imperative that in most instances φ be smooth in the interior of a computational domain; otherwise, ∇2φ will blow up at an interior collocation point and then one cannot use a trial function that uses φ in a collocation-based PINN method. So in most instances in 2D or 3D, this precludes the use of exact distance functions in the ansatz, and hence approximate distance functions should be used. To show this, let us consider the exact signed distance to the unit disk, φ(x) = 1 − √x2 + y2. We can write 
+
+∂2φ∂x2 = x2
+
+(x2 + y2)3/2 − 1
+
+√x2 + y2 ,
+
+and since the second term is unbounded at the origin, the Laplacian of φ blows up at the origin. There are exceptions when the exact distance function can be used. It is a suitable choice when the medial axis of a domain is not part of the computational domain. For example, when solving a boundary-value problem over an annulus in 2D (see the problem solved in Section 10.2.3) or a hollow cylinder in 3D, then the exact distance function can be used since the origin (where the exact distance function has derivative discontinuities) lies outside the computational domain. 0.0 0.51.0 0 .00.51.00.05 0.10 0.0 0.51.0 0 .00.51.00.025 0.050 0.075 0.0 0.51.0 0 .00.51.0 
+
+> −40
+> −20 0.00.51.0 0 .00.51.0
+> −30
+> −20
+> −10
+
+(a) φR(x) (b) φM (x) (c) ∇2φR(x) (d) ∇2φM (x)  
+
+> Figure 27: Computation of φ(x) and ∇2φ(x) over the unit square for ADFs constructed from R-equivalence and mean value potential fields.
+
+10.2. Steady-state heat conduction 
+
+Let us consider the following model problem for isotropic steady-state heat conduction: 
+
+−∇ 2u = f in Ω ⊂ R2 (67a) 
+
+u = g on Γu, ∂u
+
+∂n + cu = h on Γn (67b) where u(x) is the temperature field and f (x) is the heat source. The boundary ∂Ω = Γ u ∪ Γn is partitioned into two parts, with Γn ∩ Γn = ∅. The temperature field g(x) is imposed on the essential boundary Γu, and boundary data h(x)that is associated with a Robin boundary condition is prescribed on Γn (c is in general a spatially varying field). 
+
+10.2.1. Essential boundary conditions 
+
+As the first example, we consider the biunit square, Ω = (−1, 1) 2, with u = g = 0 prescribed on ∂Ω. If k ∈ N and 
+
+f (x, y) = sin( kπx) sin( kπy) is the forcing function, then the exact solution for this problem is: 
+
+u(x) = sin( kπx) sin( kπy)2k2π2 .
+
+In the numerical computations, we consider two distinct forms of trial functions in the neural network. The first ansatz is the standard PINN that is given by ˜ unn (x; θ), which does not a priori satisfy the boundary condition. The second form consists of trial functions that are given by ˜ ubc nn (x; θ) = φ(x)˜ uR
+
+> nn
+
+(x; θ), where φ(x) is a function that is zero on ∂Ω.This property of φ(x) ensures that ˜ ubc nn (x; θ) automatically satisfies the essential boundary conditions. While an obvious choice for φ(x) is (1 − x2)(1 − y2), here we consider φ(x) that are constructed using R-functions with R-equivalence composition (see Section 3.2) and by mean value potential fields (see Section 4) as they readily generalize to more 31 complex domains. When needed, we use the acronyms REQ and MVP to distinguish the numerical solutions, ˜ ubc nn (x), which are obtained using these two methods. The plot of φ(x) using REQ and MVP over a square are shown in Figures 7b and 10a, and it can be observed that φ is zero on the boundary of the domain. For the collocation scheme, we randomly sample NI number of points in Ω and NB number of points on ∂Ω. To solve the problem using standard PINN, we minimize the loss Lnn (θ) given in (41), which is reproduced below (using g = 0): 
+
+Lnn (θ) = ||∇ 2 ˜unn (x; θ) + f (x)|| 2 
+
+> Ω,NI
+
++ || ˜unn (x; θ)|| 2 
+
+> Ω,NB
+
+,
+
+where || · || Ω,NI and || · || ∂Ω,NB are defined in (40) and (41). Since ˜ ubc nn (x; θ) automatically satisfies the boundary condition, the parameters in this ansatz are found by minimizing the loss given in (40), which for this problem is: 
+
+Lbc nn (θ) = ||∇ 2 ˜ubc nn (x; θ) + f (x)|| 2 
+
+> Ω,NI
+
+.
+
+Figure 28 shows the training loss and normalized absolute error as functions of the training epochs for k = 1, 2(NI = 5, 000 , NB = 400). We observe from Figs. 28a and 28c that the training loss for ˜ unn (x; θ) is either comparable to or less than the loss for ˜ ubc nn (x; θ) over the same number of epochs. However, this does not translate into better prediction accuracy. Figures 28b and 28d show the normalized absolute errors for ˜ ubc nn (x; θ) (REQ and MVP) as well as ˜ unn (x; θ). These plots reveal that both REQ- and MVP-based schemes deliver an order of magnitude more accurate solutions compared to ˜ unn (x; θ). 0 5000 10000 Epochs 10 −4      
+
+> 10 −3
+> 10 −2
+> 10 −1
+> Training loss REQ MVP ˜unn
+> 05000 10000 Epochs 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error REQ MVP ˜unn
+> 05000 10000 Epochs 10 −3
+> 10 −2
+> 10 −1
+> Training loss REQ MVP ˜unn
+> 05000 10000 Epochs 10 −1
+> 10 0
+> Error  REQ MVC ˜unn
+
+(a) k = 1 (b) k = 1 (c) k = 2 (d) k = 2         
+
+> Figure 28: Training loss and normalized absolute errors for ˜ ubc nn (x;θ) (REQ and MVP) and ˜ unn (x;θ) in the heat conduction problem with homoge-neous Dirichlet boundary conditions and forcing function sin kπx. (a), (b) k=1, and (c), (d) k=2.
+
+It is interesting to note that ˜ ubc nn (x; θ) produces smaller normalized absolute errors during training than ˜ unn (x; θ)even though it has larger losses. This observation is noticed in almost all cases, and it deserves some comments here. We mention that there is no reason to assume that the losses for ˜ ubc nn (x; θ) and ˜ unn (x; θ) should be comparable. For the problem under consideration, Lbc nn (θ) comprises of terms that involve the derivatives of φ(x) and these terms are not present in Lnn (θ). A further issue with ˜ unn (x; θ) is the relative scaling of the losses in Lnn (θ), which comprises of a loss on ∇2 ˜unn (x; θ) and a loss on ˜ unn (x; θ). In this problem, the norm of u is much smaller than the norm of its Laplacian, and therefore the ∇2 ˜unn (x; θ) term dominates in Lnn (θ). However, the optimizer can drive Lnn (θ) to very small values without adequately addressing the boundary loss term. We can see that this is indeed the issue if we compare the prediction errors for ˜ unn (x) over the domain (see Fig. 29). We see that the prediction errors for REQ and MVP schemes are much smaller than ˜ unn (x), and furthermore, the prediction errors for ˜ unn (x) are large near the boundary of the domain. This suggests that ˜ unn (x; θ) is undervaluing the boundary loss term in Lnn (θ). In principle, it is possible to improve the results for ˜ unn (x; θ) by assigning a larger weight to the boundary loss term in Lnn (θ) [16], but this is an ad-hoc remedy, which is not needed in our approach since the boundary condition is exactly met. To see that this is indeed the case, one can assume the loss function for ˜ unn (x; θ) to be a convex combination of the interior and boundary loss terms: 
+
+Lnn (θ) = w ||∇ 2 ˜unn (x; θ) + f (x)|| 2
+
+> Ω,NI
+
+︸ ︷︷ ︸
+
+> PDE loss
+
++ (1 − w) || ˜unn (x; θ)|| 2
+
+> ∂Ω,NB
+
+︸ ︷︷ ︸
+
+> Boundary loss
+
+,
+
+where w ∈ [0 , 1] is a scalar that can be used to tune the relative weights of the two losses. Figures 30a–30c show the evolution of the PDE loss and the boundary loss for di ff erent values of w (k = 1). The value w = 0.1 weighs the boundary loss term 9 times more than the PDE loss term and it achieves smaller 32 −1 0 1 −101
+
+> ×10 −4
+
+01
+
+−1 0 1 −101
+
+> ×10 −4
+
+−2
+
+−101
+
+−1 0 1 −101
+
+> ×10 −3
+
+−101(a) (b) (c) 
+
+Figure 29: Prediction errors for k = 1 in Example 1. (a) ˜ ubc nn (x) (REQ), (b) ˜ ubc nn (x) (MVP), and (c) ˜ unn (x). 
+
+error than both w = 0.5, 0.9 (Fig. 30d). This suggests that there likely is a sweet spot for w that results in very low errors. However, the exact regime for this solution is likely dependent on the problem under consideration and on the boundary conditions, the determination of which may be impossible to ascertain in problems where the exact solution is unknown. A distinguishing attribute of our approach is that no such tuning of relative weights is needed. 0 5000 10000 Epochs 10 −18             
+
+> 10 −14
+> 10 −10
+> 10 −6
+> 10 −2
+> PDE loss Boundary loss 05000 10000 Epochs 10 −20
+> 10 −15
+> 10 −10
+> 10 −5
+> 10 0
+> PDE loss Boundary loss 05000 10000 Epochs 10 −8
+> 10 −6
+> 10 −4
+> 10 −2
+> PDE loss Boundary loss 05000 10000 Epochs 10 −2
+> 10 −1
+> 10 0
+> w=0.5
+> w=0.1
+> w=0.9
+
+(a) w = 0.5 (b) w = 0.1 (c) w = 0.9 (d) Error 
+
+Figure 30: Loss function for ˜ unn (x; θ) in Example 1 is a convex combination of PDE loss and boundary loss terms ( k = 1). Evolution of loss function for (a) w = 0.5, (b) w = 0.1 and (c) w = 0.9. (d) Evolution of normalized absolute error during training for w = 0.5, 0.1, 0.9. 
+
+10.2.2. Example 2 
+
+We consider the Laplace equation ( f = 0) over the unit square, Ω = (0 , 1) 2, with boundary conditions 
+
+u(x) = 0 on Γ1, Γ2, Γ3, u(x) = g4(x) = sin πx on Γ4, (68) where Γ1 = {(x, y) : x = 0, 0 ≤ y ≤ 1}, Γ2 = {(x, y) : 0 ≤ x ≤ 1, y = 0}, Γ3 = {(x, y) : x = 1, 0 ≤ y ≤ 1}, and 
+
+Γ4 = {(x, y) : 0 ≤ x ≤ 1, y = 1} are the boundary edges. The exact solution for this problem is: 
+
+u(x) = (e−πy + eπy) sin πxe−π + eπ .
+
+We chose this problem to demonstrate how to exactly satisfy nonzero essential boundary conditions on di ff erent subsets of the boundary through the use of transfinite interpolation. To construct a trial solution that satisfies the boundary conditions, we first create a composite approximate distance function, φ(x), to Γ = Γ 1 ∪ Γ2 ∪ Γ3 ∪ Γ4.This ADF can either be formed by the joining operation via R-equivalence, or directly via (14) that uses mean value potential fields on polygons. The resultant φ(x) is similar to the φ(x) used in Example 1 (see Fig. 27). We combine the Dirichlet boundary data into one function g(x) by using the transfinite interpolant in (26). In Fig. 31, the function g(x) and its Laplacian are plotted over the unit square. We observe that g(x) is zero on Γα
+
+(α = 1, 2, 3), and it is equal to sin πx on Γ4. Referring to (27), the trial function for PINN is: ˜ubc nn (x; θ) = g(x) + φ(x) ˜ uR
+
+> nn
+
+(x; θ).
+
+33 Since g(x) satisfies the boundary conditions in (68) and φ = 0 on Γ, ˜ ubc nn (x; θ) satisfies the Dirichlet boundary conditions on all edges of the boundary. From Fig. 31b, we observe that the Laplacian of g(x) is singular at two of the four vertices on the boundary and again this is handled by performing collocation over the smaller square [0 .01 , 0.99] 2.x
+
+0.0 0.2 0.4 0.6 0.8 1.0
+
+y
+
+0.00.20.40.60.81.00.20.40.60.8
+
+x
+
+0.0 0.2 0.4 0.6 0.8 1.0
+
+y
+
+0.00.20.40.60.81.0050 100 150 
+
+(a) g(x) (b) ∇2g(x)  
+
+> Figure 31: Plots of (a) g(x) and (b) ∇2g(x) over the unit square.
+
+We determine the parameters of this network by minimizing Lbc nn (θ) as described earlier (with f set to zero). For standard PINN, the parameters of ˜ unn (x; θ) are determined by minimizing the following loss function: 
+
+Lnn (θ) = ||∇ 2 ˜unn (x; θ)|| 2 
+
+> Ω,NI
+
++
+
+> 4
+
+∑
+
+> α=1
+
+|| ˜unn (x; θ) − gα(x)|| 2
+
+> Γα,Nα
+> B
+
+,
+
+where Nα 
+
+> B
+
+is the number of collocation points on Γα (α = 1, 2, 3, 4) and gα(x) = 0 ( α = 1, 2, 3, 4) in this problem. In the computations, we select a total of 400 boundary collocation points and 5,000 interior collocation within [ .01 , . 99] 2.The numerical results are shown in Fig. 32, where the training loss, the normalized absolute error, and exact and approximate solutions are presented for ˜ ubc nn (x; θ) and ˜ unn (x; θ). As in the previous example, here also we notice that the training loss for ˜ unn (x; θ) is orders of magnitude smaller than both REQ and MVP (Fig. 32a). The losses for ˜ubc nn (x; θ) start at relatively high values during the initial stages of the training, which is due to the large contributions from the Laplacian in the vicinity of the vertices. However, the network is quickly able to optimize and bring the losses down by almost two orders of magnitude in a couple of thousand training epochs. Still, the losses for ˜ ubc nn (x; θ)remain several orders of magnitude larger than ˜ unn (x; θ) at the end of the training. However, this example also reveals that the absolute value of the loss in itself is not very meaningful. The normalized absolute errors of the three schemes are presented in Fig. 32b as a function of the training epochs. It is evident from this plot that the errors in ˜ ubc nn (x; θ) are orders of magnitude smaller than the error in ˜ unn (x; θ). The error achieved by ˜ ubc nn (x; θ) (REQ) is almost an order of magnitude smaller than ˜ ubc nn (x; θ) (MVP), and almost two orders of magnitude better than ˜ unn (x; θ). Contour plots of the exact and approximate solutions appear in Figs. 32c–32f. The predicted errors of the three schemes are displayed in Fig. 33. It can be seen that the boundary errors in both REQ and MVP are precisely zero. This is expected, since ˜ ubc nn (x; θ) has been designed to satisfy the boundary conditions. On the other hand, ˜ unn (x; θ) has large errors on the boundary of the domain. The errors from ˜ unn (x; θ)are roughly an order of magnitude larger than MVP and two orders of magnitude larger than REQ. As in the previous example, one can improve the results of ˜ unn (x; θ) by weighing the boundary loss more than the PDE loss in Lnn (θ). This problem can also be solved using a Ritz scheme, which is appealing since only first-order derivatives are required in the loss function. We form a trial function that satisfies the essential conditions using (27): ˜ubc nn (x; θ) = g(x) + φ(x) ˜ uR
+
+> nn
+
+(x; θ).
+
+The parameters of the network can now be found by minimizing the loss in (47). To numerically evaluate the integral, we divide the square into a uniform grid with NI number of interior points. Since the Ritz loss does not involve second 34 0 5000 10000 Epochs 10 −2          
+
+> 10 0
+> 10 2
+> Training loss REQ MVP ˜unn
+> 05000 10000 Epochs 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error REQ MVP ˜unn
+> 0.00.51.0
+> x
+> 0.00.20.40.60.8
+> y
+> 0.00.51.0
+> x
+> 0.00.20.40.60.8
+> y
+> 0.00.51.0
+> x
+> 0.00.20.40.60.8
+> y
+> 0.00.51.0
+> x
+> 0.00.20.40.60.8
+> y
+> 0.00 0.15 0.30 0.45 0.60 0.75 0.90
+
+(a) (b) 
+
+(c) (d) (e) (f) 
+
+Figure 32: Numerical results for the Laplace problem on the unit square with nonzero essential boundary conditions. (a), (b) Training loss and normalized absolute errors for ˜ ubc nn (x; θ) (REQ and MVP) and ˜ unn (x; θ). Contour plots over the unit square of the (c) exact solution, (d) ˜ ubc nn (x)(REQ), (e) ˜ ubc nn (x) (MVP), and (f) ˜ unn (x). x
+
+> 0.0 0.2 0.4 0.6 0.8 1.0
+> y
+> 0.00.20.40.60.81.0
+> −0.0005 0.0000 0.0005
+> x
+> 0.0 0.2 0.4 0.6 0.8 1.0
+> y
+> 0.00.20.40.60.81.0
+> −0.002 0.000 0.002 0.004
+> x
+> 0.0 0.2 0.4 0.6 0.8 1.0
+> y
+> 0.00.20.40.60.81.0
+> −0.02 0.00 0.02
+
+(a) (b) (c) 
+
+Figure 33: Surface plots of the errors in the numerical solutions for the Laplace problem with nonzero essential boundary condition. (a) ˜ ubc nn (x)(REQ), (b) ˜ ubc nn (x) (MVP), and (c) ˜ unn (x). 
+
+derivatives of φ(x) or g(x), all terms in the loss are well-defined and bounded even arbitrarily close to the boundaries of the domain. We select 5,000 interior points on the square [0 .0001 , 0.9999] 2. For a Dirichlet problem, it is especially important to sample close to the boundaries, because in the absence of doing so, the loss may be trivially minimized by a u(x) that is a constant. Sampling close to the boundaries informs the algorithm that a constant u(x) leads to large errors near the boundaries, which are manifested in the loss term as large gradients. Numerical results for the Ritz method using ˜ unn (x; θ) with REQ are presented in Fig. 34. The training loss and normalized absolute errors as a function of epochs are shown in Fig. 34a, which reveal that the error reduces to O(10 −2) in less than 2,000 epochs. In Fig. 34b, the prediction errors over the square are displayed. Compared to Fig. 33, we find that the errors in the Ritz scheme are smaller than MVP-based collocation but larger than REQ-based collocation. 35 0 2500 5000 7500 10000 Epochs 10 −2
+
+10 −1
+
+10 0
+
+> Training loss and error
+
+Loss Error 0.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00 
+
+> ×10 −3
+
+−5.0
+
+−2.50.02.5(a) (b)  
+
+> Figure 34: Ritz solution using ˜ ubc nn (x;θ) (REQ) for the Laplace problem in Example 2. (a) Evolution of training loss and normalized absolute error, and (b) Prediction errors over the domain after training.
+
+10.2.3. Curved domain 
+
+For a problem with a curved domain, we consider the Laplace equation, ∇2u = 0, in an annulus that is bounded between circles of radii, R1 = 1 (boundary Γ1) and R2 = 1/4 (boundary Γ2) [71]. The essential boundary conditions are: 
+
+u = 1 on Γ1, u = 2 on Γ2.
+
+The exact solution for this problem is: 
+
+u(x) = 1 − ln √x2 + y2
+
+ln 4 . (69) To impose the boundary conditions, we need a distance function to both Γ1 and Γ2, and also a composite boundary data function g(x). In this case, on using the exact distance functions to the two circles, we have φ1(x) = 1 − √x2 + y2
+
+(positive in the interior of the larger disk), and φ2(x) = √x2 + y2 − 1/4 (positive outside the smaller disk). Since the origin is not part of the computational domain, here we can use the exact distance functions to form φ1 and φ2. Now, on combining these two ADFs using the R-equivalence operation ( m = 1) in (9), we obtain φ(x) = φ1 ∼ φ2 (positive in the annulus). Finally, we use the transfinite formula (26) to construct the composite boundary data as 
+
+g(x) = 2φ1 + φ2
+
+φ1 + φ2
+
+= 7 − 4 √x2 + y2
+
+3 .
+
+In Fig. 35, φ(x) and g(x) are plotted over the annulus. To clearly see that φ(x) is zero on Γ1 and Γ2, we show −φ(x)in Fig. 35a. From the plot in Fig. 35b, we observe that g(x) matches the imposed boundary data on Γ1 and Γ2. An ansatz that exactly satisfies the boundary conditions is: ˜ ubc nn (x; θ) = g(x) + φ(x) ˜ uR
+
+> nn
+
+(x; θ). We compare the performance of ˜ ubc nn (x; θ) with a standard PINN trial function, ˜ unn (x; θ), where the boundary conditions need to be enforced through the loss function. The loss function for the two cases are similar to those discussed in previous examples, the only di ff erence being that now we sample the boundary data at NB points on the curved boundary Γ = Γ 1 ∪ Γ2 and the interior collocation data at NI points in Ω1 ∩ Ω2. We use dmsh [95] to triangulate the annulus and choose the centroid of the triangles as interior collocation points and the center of the edges on the boundary as the boundary collocation points. For this problem, we pick NI = 612 points in the interior of the domain, 66 points on Γ1 and 30 points on Γ2 for a total of NB = 96 points on Γ1 ∪ Γ2. A representative mesh that display the interior collocation points over the annulus is shown in Fig. 15. We use the Adam optimizer with a learning rate (step size) of 10 −3 for training both ˜ ubc nn (x; θ) and ˜unn (x; θ), and the training is stopped at 10,000 epochs for both networks in order to perform a fair comparison. In Fig. 36, the results of the training as well as the approximate solutions produced by ˜ ubc nn (x; θ) (REQ) and ˜ unn (x; θ)are presented. It is pertinent to mention here that ˜ unn (x; θ) required a much larger network architecture (2–150–150–1) compared to ˜ ubc nn (x; θ), which only required a 2–50–50–1 network in order to converge to acceptable results. However, even with a much larger network, as observable in Figs. 36a and 36b, the error in ˜ unn (x; θ) by the end of the training is 36 −1.0−0.50.0 0.5 1.0−1.0
+
+−0.50.00.51.0
+
+−0.15 
+
+−0.10 
+
+−0.05 0.00 
+
+−1.0−0.50.0 0.5 1.0−1.0
+
+−0.50.00.51.01.01.21.41.61.82.0(a) (b) 
+
+Figure 35: Approximate distance function to the boundaries of the annulus using R-equivalence composition of the exact distance functions to Γ1
+
+and Γ2. Plots of (a) −φ(x) and (b) g(x) that interpolate essential boundary data on Γ1 and Γ2.
+
+two orders of magnitude larger than ˜ ubc nn (x; θ). It can be seen from Fig. 36e that ˜ unn (x; θ) does not satisfy the boundary conditions. The approximation errors of ˜ ubc nn (x) (REQ) and ˜ unn (x) appear in Figs. 36f and 36g. The errors in ˜ unn (x) are especially large on the boundaries of the domain. 0 2500 5000 7500 10000 Epochs 10 −3      
+
+> 10 −2
+> 10 −1
+> 10 0
+> Training loss REQ (2-50-50-1) ˜unn (2-150-150-1) 02500 5000 7500 10000 Epochs 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Error  REQ (2-50-50-1) ˜unn (2-150-150-1)
+> −1.0
+> −0.50.0 0.5 1.0−1.0
+> −0.50.00.51.01.01.21.41.61.82.0
+> −1.0
+> −0.50.0 0.5 1.0−1.0
+> −0.50.00.51.01.01.21.41.61.82.0
+> −1.0
+> −0.50.0 0.5 1.0−1.0
+> −0.50.00.51.01.21.41.61.8
+> −1.0
+> −0.50.0 0.5 1.0−1.0
+> −0.50.00.51.0
+> ×10 −4
+> −7.5
+> −5.0
+> −2.50.02.5
+> −1.0
+> −0.50.0 0.5 1.0−1.0
+> −0.50.00.51.0
+> ×10 −1
+> −1.0
+> −0.50.00.5
+
+(a) (b) (c) (d) (e) (f) (g) 
+
+Figure 36: Numerical solutions for the Laplace problem on an annulus with Dirichlet boundary conditions. (a), (b) Training loss and normalized absolute errors of ˜ ubc nn (x; θ) (REQ) and ˜ unn (x; θ). Surface plots of the (c) exact solution, (d) ˜ ubc nn (x; θ) using 2–50–50–1 network, and (e) ˜ unn (x) using 2–150–150–1 network. Surface plots of the error for the numerical solutions (f) ˜ ubc nn (x) and (g) ˜ unn (x). 
+
+We now use the Laplace problem over the annulus to also demonstrate how to solve the problem using mixed boundary conditions. To this end, we retain the essential boundary condition on Γ1 and convert the boundary condition on Γ2 to a Robin boundary condition. For this problem, we find that ∂u/∂ n = 4/ ln 4 on the inner boundary. We use 37 the following mixed boundary conditions: 
+
+u = 1 on Γ1, ∂u
+
+∂n + u = 2 + 4ln 4 =: h on Γ2. (70) The exact solution remains unchanged and is given in (69). To create an ansatz for this problem with mixed boundary conditions, we follow the formulation in Section 5.3. We form φ1, φ2, which remain unchanged from the previous case when essential boundary conditions are imposed on Γ1 and Γ2. Since g = 1, referring to (35), we can write 
+
+u1(x) = g(x) = 1, u2(x) = [1 + φ2(1 + Dφ2 
+
+> 1
+
+)] (˜ uR
+
+> nn
+
+(x; θ)) − φ2h, (71) where h is given in (70), and then on using transfinite interpolation given in (35a) and (35b), we form the trial function ˜ubc nn (x; θ). So we now have an ansatz that satisfies both the Dirichlet and Robin boundary conditions. The training loss and normalized absolute errors for the numerical solution using ˜ ubc nn (x; θ) are presented in Figs 37a and 37b. On comparing ˜ ubc nn (x) in Fig. 37c to the exact solution shown in Fig. 36, we see that the boundary conditions have been exactly satisfied. The error plot in Fig. 37d reveals that the numerical solution is within 1 percent of the exact solution over the domain. 0 10000 20000 30000 Epochs 10 0   
+
+> 10 1
+> 10 2
+> Training loss 010000 20000 30000 Epochs 10 −2
+> 10 −1
+> 10 0
+> Error
+> −1.0−0.50.0 0.5 1.0−1.0
+> −0.50.00.51.01.01.21.41.61.82.0
+> −1.0−0.50.0 0.5 1.0−1.0
+> −0.50.00.51.0
+> −0.010
+> −0.005 0.000 0.005 0.010
+
+(a) (b) (c) (d)  
+
+> Figure 37: Numerical solution using ˜ ubc nn (x;θ) (REQ) for the Laplace problem in an annulus with mixed boundary conditions. (a), (b) Training loss and normalized absolute errors as a function of epochs. Surface plots of (c) ˜ ubc nn (x) and (d) error in ˜ ubc nn (x).
+
+10.3. Generalized barycentric coordinates over polygons 
+
+Consider a planar polygon with n vertices (nodal coordinate {xi}ni=1) that are in counterclockwise orientation. On an n-gon, harmonic coordinates [99] are one of the instances of generalized barycentric coordinates [79]. Each coordinate (shape function), ϕi := ϕi(x), is associated with vertex i and is obtained by solving the Laplace equation with piecewise a ffi ne Dirichlet boundary conditions. The boundary-value problem for harmonic coordinates is: find 
+
+φi ≥ 0 ( i = 1, 2, . . . , n) that solves 
+
+∇2ϕi = 0 in Ω, (72a) 
+
+ϕi = gi on ∂Ω, (72b) where gi := gi(x) is a piecewise a ffi ne (hat) function that is unity at xi and is zero at all other vertices, i.e., gi(x j) = δi j ,where δi j is the Kronecker-delta. By virtue of the maximum principle for the Laplace equations, φi > 0 in the interior of the polygon. Here we will solve the harmonic coordinate problem on two representative polygons: a square and an L-shaped (nonconvex) polygon. For both examples, the network architecture 2–50–50–1 is used. In both examples, an im-portant step is to assemble the boundary data into a function g(x) through transfinite interpolation. Figure 38 shows the function g(x) for specific choices of the vertex i. If the vertices of the square are numbered 1–2–3–4 (coun-terclockwise), starting at vertex 1 that is at (0 , 0), then the harmonic coordinate u(x) must satisfy u(0 , 0) = 1, 
+
+u(1 , 0) = u(1 , 1) = u(0 , 1) = 0. So the boundary conditions are a ffi ne along edges 1–2 and 4–1 and identically zero along edges 2–3 and 3–4. All these boundary conditions are simultaneously captured in the g(x) function. It can be seen from the colormaps that g(x) appropriately interpolates the boundary data in all cases. 38 0.2 0.4 0.6 0.80.20.40.60.8       
+
+> −0.50.00.5
+> −0.8
+> −0.6
+> −0.4
+> −0.20.00.20.40.60.8
+> 0.00.20.40.60.81.00.00.20.40.60.81.0
+> 0.00 0.15 0.30 0.45 0.60 0.75 0.90
+
+(a) (b) (c)  
+
+> Figure 38: Computation of harmonic coordinate for a vertex in a polygon. Contour plots of g(x) over a (a) square (vertex at the origin), (b) regular hexagon (rightmost vertex), and (c) L-shaped polygon (vertex at the origin).
+
+10.3.1. Harmonic coordinates on a square 
+
+We first consider the case of computing the harmonic coordinate over a square. On a square, there exists an exact solution for this problem—harmonic coordinates coincide with bilinear finite element shape functions. So the solution that is associated with vertex 1 is: u(x, y) = (1 − x)(1 − y). To compute the harmonic coordinates, we adopt the Ritz method to determine the approximate solutions. Since harmonic coordinates minimize the Dirichlet energy, use of the Ritz formulation is natural. As done earlier, we consider the approximations ˜ ubc nn (x; θ) and ˜ unn (x; θ). For ˜ubc nn (x; θ), the loss to be minimized is given in (47), whereas for ˜ unn (x; θ), it is supplemented with an additional term 
+
+∑n 
+
+> α=1
+
+|| ˜unn (x; θ)−gα(x)|| 2
+
+> ∂Ωα,Nα
+> B
+
+to impose the essential boundary conditions. Here, n is the number of boundary segments over which the Dirichlet boundary conditions are specified. In Fig. 39, numerical results using ˜ ubc nn (x; θ) (REQ) and ˜ unn (x; θ) are presented. In Figs. 39a and 39b, the training loss and normalized absolute error as a function of epochs are presented. The exact solution along with the numerical solutions are displayed in Figs. 39c–39e. Surface plots of the errors in ˜ ubc nn (x) and ˜ unn (x) appear in Figs. 39f and 39g. Once again, and consistent with prior findings, ˜ ubc nn (x; θ) exactly satisfies the essential boundary conditions and has far smaller errors than ˜ unn (x; θ). The large errors in ˜ unn (x; θ) are particularly noticeable in Fig. 39g. The issue again has to do with the relative scaling of the PDE loss versus the boundary loss. As done earlier, the numerical results from ˜unn (x; θ) can be improved by considering a loss of the form: 
+
+Lnn (θ) = w
+
+
+
+1
+
+NINI∑
+
+> k=1
+
+[ 12 a (˜unn (xk; θ), ˜unn (xk; θ)) − ` (˜unn (xk; θ))]  + (1 − w)
+
+
+
+> n
+
+∑
+
+> α=1
+
+|| ˜unn (x; θ) − gα(x)|| 2
+
+> ∂Ωα,Nα
+> B
+
+ ,
+
+and then tuning the weight w ∈ [0 , 1]. This tunes the relative importance of the boundary loss term with respect to the PDE loss term. In fact, if one considers w = 10 −3, ˜ unn (x; θ) achieves errors that are comparable to ˜ ubc nn (x; θ). In other words, the boundary loss in Lnn (θ) has to be weighed a thousand times more than the PDE loss in order for ˜ unn (x; θ)to produce results comparable to ˜ ubc nn (x; θ). 
+
+10.3.2. Harmonic coordinates on an L-shaped polygon 
+
+We repeat the computations for the square over an L-shaped polygon. Here we consider φ(x) that is formed using REQ and MVP. The plot of φ(x) for REQ and MVP over the L-shaped polygon are shown in Figs. 7d and 10b. The function g(x) over the L-shaped polygon is shown in Fig. 38c. We only present numerical simulation results for REQ and MVP. As shown for the case of the square, the loss terms have to be weighed judiciously to obtain acceptable accuracy for ˜ unn (x). Harmonic coordinates associated with the vertex at the origin and for the vertex at the reentrant corner are computed. Since an exact analytical solution for this problem is not available, we compute an accurate finite element solution that we use as the reference solution. This finite element solution is used to compute the errors in ˜ubc nn (x). A Delaunay triangular mesh is created using the mesh generation package Triangle [100]: mesh has 13,952 elements with very small elements in the vicinity of the reentrant corner and larger elements near other vertices. The mesh size, h = 10 −3, is used near the reentrant corner to capture the weakly singular behavior of the Laplace equation at the reentrant corner. 39 0 2500 5000 7500 10000 Epochs 10 −1            
+
+> 2×10 −1
+> 3×10 −1
+> 4×10 −1
+> 6×10 −1
+> Training loss REQ ˜unn
+> 02500 5000 7500 10000 Epochs 10 −2
+> 10 −1
+> 10 0
+> Error  REQ ˜unn
+> 0.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00 0.20.40.60.80.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00 0.00.20.40.60.80.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00 0.20 0.25 0.30 0.35 0.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00 0.000 0.002 0.004 0.00 0.25 0.50 0.75 1.000 .00 0.25 0.50 0.75 1.00
+> −0.6
+> −0.4
+> −0.20.00.2
+
+(a) (b) (c) 
+
+(d) (e) (f) (g) 
+
+Figure 39: Computation of harmonic coordinates on the unit square for the vertex at the origin. (a), (b) Training loss and errors for ˜ ubc nn (x; θ) (REQ) and ˜ unn (x; θ). Surface plots of the (c) exact solution, (d) ˜ ubc nn (x), and (e) ˜ unn (x). Surface plot of the error for the numerical solutions (f) ˜ ubc nn (x) and (g) ˜ unn (x). 
+
+A representative mesh that displays the interior collocation points in the L-shaped polygon is shown in Fig. 15. In Figs. 40 and 41, numerical solutions obtained from ˜ ubc nn (x; θ) (REQ and MVP) are presented for the computation of 
+
+u that is associated with vertices at (0 , 0) and (1 /2, 1/2), respectively. In Fig. 40a, we observe that the training loss stabilizes within a few thousand epochs to about 1.5 for REQ and MVP, and this correspond to a normalized absolute error (see Fig. 40b) of O(10 −2). The reference finite element solution is presented in Fig. 40c, and the error in ˜ ubc nn (x)(REQ and MVP) are displayed in Figs. 40d and 40e. Numerical solutions using REQ and MVP have maximum errors of about 3 percent. From Fig. 41a, we observe that the training loss stabilizes within a few thousand epochs to about 8 for REQ and MVP, and this correspond to a normalized absolute error (see Fig. 41b) of O(10 −1). The reference finite element solution is depicted in Fig. 41c, which display sharp gradients near the rentrant corner. The error in ˜ ubc nn (x)(REQ and MVP) are shown in Figs. 41d and 41e, with maximum errors near the singularity on the order of 20 percent. Compared to the errors in Figs. 40d and 40e, this is a 10-fold increase in the maximum error (due to the presence of the derivative singularity at the reentrant corner). 
+
+10.4. Clamped circular Kirchho ff plate 
+
+We consider the boundary-value problem for a clamped plate that is given in (48). For a clamped circular plate of unit radius and transverse load f = 1, the boundary-value problem is: 
+
+∇4u = 1 in Ω = {(x, y) : x2 + y2 < 1}, (73a) 
+
+u = 0 on ∂Ω, u, n := ∂u
+
+∂n = 0 on ∂Ω. (73b) The exact solution for this problem in polar coordinates is [101]: 
+
+u(r) = (1 − r2)2
+
+64 . (74) Given that both u and ∂u/∂ n are specified on the boundary, this problem illustrates the use of a di ff erent solution structure than the ones considered until now. To impose the boundary conditions, we first create a distance function 40 0 5000 10000 Epochs 1.5 × 10 0
+
+1.6 × 10 0
+
+1.7 × 10 0
+
+1.8 × 10 0
+
+1.9 × 10 0
+
+2 × 10 0
+
+> Training loss
+
+REQ MVP 0 5000 10000 Epochs 10 −1
+
+3 × 10 −2
+
+4 × 10 −2
+
+6 × 10 −2
+
+> Error
+
+REQ MVP 0.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.15 0.30 0.45 0.60 0.75 0.90 
+
+−0.048 
+
+−0.036 
+
+−0.024 
+
+−0.012 0.000 0.012 0.024 
+
+−0.027 
+
+−0.018 
+
+−0.009 0.000 0.009 0.018 0.027 (a) (b) 
+
+(c) (d) (e) 
+
+Figure 40: Computation of harmonic coordinates (vertex at (0 , 0)) on an L-shaped polygon. (a), (b) Training loss and normalized absolute error for ˜ubc nn (x; θ) (REQ and MVP). (c) Reference finite element solution. Contour plots of the error for (d) ˜ ubc nn (x) (REQ) and (e) ˜ ubc nn (x) (MVP). 0 5000 10000 Epochs 10 1
+
+8 × 10 0
+
+9 × 10 0
+
+> Training loss
+
+REQ MVP 0 5000 10000 Epochs 4 × 10 −1
+
+5 × 10 −1
+
+> Error
+
+REQ MVP 0.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.25 0.50 0.75 1.00 0.00.20.40.60.81.00.00 0.15 0.30 0.45 0.60 0.75 0.90 0.00 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.00 0.03 0.06 0.09 0.12 0.15 0.18 0.21 
+
+(a) (b) 
+
+(c) (d) (e) 
+
+Figure 41: Computation of harmonic coordinates (vertex at (1 /2, 1/2)) on an L-shaped polygon. (a), (b) Training loss and normalized absolute error for ˜ ubc nn (x; θ) (REQ and MVP). (c) Reference finite element solution. Contour plots of the error for (d) ˜ ubc nn (x) (REQ) and (e) ˜ ubc nn (x) (MVP). 
+
+φ to the circular boundary. In this case, the distance function can be exactly determined as φ(x) = 1 − √x2 + y2;however, this exact distance function has derivative singularities at the origin. Therefore, we use an approximate 41 distance function to a unit circle that is given in (7), which is reproduced below: 
+
+φ(x) = 1 − x · x
+
+2 ,
+
+which is a bivariate polynomial. Now we construct an ansatz that satisfies both essential boundary conditions us-ing (50): ˜ ubc nn (x; θ) = φ2 ˜u(x; θ). As with previous problems, this problem can be solved using either the collocation approach (see Guo et al. [102]) or the Ritz method. For collocation, we minimize the loss function 
+
+Lbc nn (θ) = ||∇ 4 ˜ubc nn (x; θ) − 1|| 2 
+
+> Ω,NI
+
+,
+
+whereas for the Ritz approach, we minimize the loss function that is presented in Section 7.2.2. Numerical results for collocation and Ritz using ˜ ubc nn (x; θ) are presented in Fig. 42. In the computations, 2800 interior points are used for both methods. For both collocation and Ritz we use the cubic ReLU activation function. Note that for collocation with standard PINN one cannot use the cubic ReLU since it is a biharmonic function. The trial function ˜ ubc nn (x; θ) has other terms that are present in it, and hence in general it is not biharmonic ( ∇4 ˜ubc nn , 0). A representative mesh that display the interior points over a disk is shown in Fig. 15. The network architecture is 2–50–50–1. The training loss as a function of epochs for the collocation and Ritz methods are shown in Fig. 42a. The exact solution is shown in Fig. 42b, and the collocation and Ritz solution for ˜ ubc nn (x) after 10,000 epochs are plotted in Figs 42c and 42d. The surface plots of the error in the Ritz and collocation solutions are presented in Figs 42e and 42f. We observe that the unnormalized error in the Ritz and collocation methods are O(10 −5); the latter is consistent with the accuracy reported in Guo et al. [102]. 0 2500 5000 7500 10000 Epochs 10 −3     
+
+> 10 −2
+> 10 −1
+> 10 0
+> Error Ritz Collocation
+> −101−1010.000 0.005 0.010 0.015
+> −101−1010.000 0.005 0.010 0.015
+> −101−1010.000 0.005 0.010 0.015
+> −101−101
+> ×10 −5
+> −202
+> −101−101
+> ×10 −5
+> −1.0
+> −0.50.0
+
+(a) (b) (c) (d) (e) (f)   
+
+> Figure 42: Numerical solutions for clamped circular Kirchho ff plate bending problem using ˜ ubc nn (x;θ) (Ritz and collocation). Network architecture used is 2–50–50–1. (a) Normalized absolute error in training. Surface plots of of the (b) exact solution, and (c), (d) Ritz and collocation solutions. Surface plots of (e) error in Ritz solution and (f) error in collocation solution.
+
+10.5. Eikonal equation 
+
+We consider the Eikonal equation, which is a first-order nonlinear hyperbolic PDE. The boundary-value problem of the Eikonal equation is: 
+
+||∇ u|| = 1
+
+f in Ω ⊂ R2, (75a) 
+
+u = 0 on Γ, (75b) 42 where Γ is an interface in two dimensions and f (x) > 0 is the speed on the interface. If f (x) = 1, then u(x) is the shortest (signed) distance from x to the boundary ∂Ω. If the zero level curve of u(x) represents the initial location of the interface, then u−1(t) yields the location of the interface at time t. Hence, u(x) represents the shortest time (arrival time) that is required to travel from the boundary Γ to x. For monotonically advancing fronts ( f > 0), the fast marching method [72] and the fast sweeping method [103] are highly e ffi cient methods to solve (75). When upwind finite-di ff erences are used to solve (75), it implies a causality: value of u(x) only depends on values of u(y) for which 
+
+u(x) > u(y). No such restriction is used in PINN during training—we use a collocation method, where the exact satisfaction of the Dirichlet condition on Γ is met by constructing ADFs that use R-equivalence from Section 3.2 and the generalized mean value potential from Section 4.1. We solve (75) with f = 1 to compute the signed distance function using PINN. The cubic ReLU activation function, which is a C2 function, is used for all problems. The closed interface Γ is embedded within the biunit square, Ω0 = (−1, 1) 2, and as benchmark problems we consider a ffi ne (polygonal) and curved interfaces for Γ. The first problem that we consider is the computation of the signed distance function to the boundary of a smaller square, 
+
+Ω = (−1/2, 1/2) 2. The Dirichlet boundary condition u = 0 is imposed on Γ = ∂Ω. The network architecture used is 2–30–30–30–1 for ˜ ubc nn and 2–70–70–1 for ˜ unn . For collocation, 10,000 points are used in the interior of the biunit square, and 400 points on the boundary Γ for ˜ unn (x; θ). Numerical results are presented in Fig. 43. The training loss and normalized absolute errors as a function of epochs for ˜ ubc nn (x; θ) (REQ and MVP) and ˜ unn (x; θ) are shown in Figs. 43a and 43b. Once again, we notice that while ˜ unn (x; θ) attains the lowest loss among the three schemes, it has the highest error. This problem has an exact solution, which is shown as a contour plot in Fig. 43c. The exact distance, 
+
+u(x), achieves its maximum value of 1 / √2 at the corners of the biunit square and its minimum value of −1/2 at the center. The error in the numerical solutions are plotted in Figs. 43d–43f. While ˜ ubc nn (x) (REQ and MVP) satisfy the boundary condition exactly, MVP results in a slightly more accurate solution in the entire domain. The L∞ norm of the error using REQ, MVP and ˜ unn (x) are 0.03, 0.026, and 1.34, respectively. The standard PINN, ˜ unn (x; θ), does poorly on satisfying the boundary condition, which in turn leads to larger pointwise errors over the whole domain. 0 20000 40000 Epochs 10 −2              
+
+> 10 −1
+> 10 0
+> Training loss ˜ubc nn (REQ) ˜ubc nn (MVP) ˜unn
+> 020000 40000 Epochs 10 −1
+> 10 0
+> Error  ˜ubc nn (REQ) ˜ubc nn (MVP) ˜unn
+> −101
+> −1.0
+> −0.50.00.51.0
+> −101
+> −1.0
+> −0.50.00.51.0
+> −101
+> −1.0
+> −0.50.00.51.0
+> −101
+> −1.0
+> −0.50.00.51.0
+> −0.027
+> −0.018
+> −0.009 0.000 0.009 0.018 0.027 0.036
+> −0.024
+> −0.018
+> −0.012
+> −0.006 0.000 0.006 0.012 0.018
+> −1.12
+> −0.96
+> −0.80
+> −0.64
+> −0.48
+> −0.32
+> −0.16 0.00 0.16
+
+(a) (b) (c) (d) (e) (f)      
+
+> Figure 43: Solving the Eikonal equation using ˜ ubc nn (x;θ) and ˜ unn (x;θ) to compute the signed distance to the boundary of a square. The network architecture used is 2–30–30–30–1 for ˜ ubc nn and 2–70–70–1 for ˜ unn . (a), (b) Training loss and normalized absolute error as a function of epochs for ˜ubc nn (x;θ) (REQ and MVP) and ˜ unn (x;θ). (c) Exact signed distance function. Contour plots of the error for (d) ˜ ubc nn (x) (REQ), (e) ˜ ubc nn (x) (MVP), and (f) ˜ unn (x).
+
+As the next problem, we consider the signed distance function to the boundary Γ of an ellipse that is centered at the origin and with semi-major and semi-minor axes of 0.25 and 0.15, respectively. The approximate distance function to the ellipse, φ(x), is computed using (8) (see Fig. 4b) for REQ and using (16) for MVP (see Fig. 13a) for MVP. The network architecture used is 2–50–50–1. The results that ˜ unn (x) produced are very poor, and hence are not included. Numerical results using ˜ ubc nn (x; θ) (REQ and MVP) are presented in Fig. 44. The training loss and the normalized error as a function of epochs is shown in Figs. 44a and 44b. The exact distance function (computed numerically) is presented in Fig. 44c, and the error plots for the numerical solutions ˜ ubc nn (x) (REQ and MVP) are shown in Figs. 44d 43 and 44e, respectively. Larger errors are concentrated in the region that is close to the center of the ellipse; away from the center the errors are less than 1 percent. The exact distance function is C0 (derivative discontinuities at the center of the ellipse), whereas the numerical solution is C2 smooth. 0 20000 40000 Epochs 10 −2                  
+
+> 10 −1
+> 10 0
+> Training loss ˜ubc nn (REQ) ˜ubc nn (MVP) 020000 40000 Epochs 10 −2
+> 10 −1
+> 10 0
+> Error ˜ubc nn (REQ) ˜ubc nn (MVP)
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −0.24 0.00 0.24 0.48 0.72 0.96 1.20
+> −0.018 0.000 0.018 0.036 0.054 0.072 0.090
+> −0.018 0.000 0.018 0.036 0.054 0.072 0.090
+
+(a) (b) (c) (d) (e)   
+
+> Figure 44: Solving the Eikonal equation using ˜ ubc nn (x;θ) to compute the signed distance to the boundary of an ellipse. The network architecture used in 2–50–50–1. (a), (b) Training loss and normalized absolute error in training as a function of epochs for ˜ ubc nn (x;θ) (REQ and MVP) . (c) Exact signed distance function. Contour plots of the error for (d) ˜ ubc nn (x) (REQ) and (e) ˜ ubc nn (x) (MVP).
+
+Lastly, we consider the signed distance function to the boundary Γ of the polygonalized map of Bhutan; see plots of the approximate distance functions to Γ using REQ and MVP that are presented in Figs. 8 and 11, respectively. The network architecture used is 2–50–50–1. Here too the results of ˜ unn (x) are not included since they are very poor. Numerical results using ˜ ubc nn (x; θ) (REQ and MVP) are presented in Fig. 45. The training loss and the normalized absolute error as a function of epochs are shown in Figs 45a and 45b. At 20,000 epochs, the absolute normalized error is O(10 −1). The exact distance function is presented in Fig. 45c, and the error plots for the numerical solutions ˜ ubc nn (x)(REQ and MVP) are presented in Figs. 45d and 45e, respectively. The maximum error is about 4 percent. 
+
+11. Poisson Problem over the Four-Dimensional Hypercube 
+
+As the last problem, we consider the model isotropic steady-state heat conduction (Poisson) problem over the 4-dimensional hypercube to show that essential boundary conditions can be readily imposed in higher dimensions as well. Consider the following Poisson problem with homogeneous Dirichlet boundary conditions: 
+
+−∇ 2u = f in Ω = (−1, 1) 4 (76a) 
+
+u = 0 on ∂Ω, (76b) where u(x) : R4 → R is sought and f (x) is the forcing function. Let x := (x1, x2, x3, x4) ∈ Ω denote a point in the hypercube. We choose f (x) = ∏4 
+
+> i=1
+
+sin( πxi), so that the exact solution is: 
+
+u(x) =
+
+∏4 
+
+> i=1
+
+sin( πxi)4π2 .
+
+As noted in prior high-dimensional studies using PINN [8, 9], numerical solutions for problems in high-dimensions are challenging since there is no easy way to mesh the domain and they are also subject to the curse of dimensionality. 44 0 10000 20000 Epochs 10 −1                  
+
+> 10 0
+> Training loss ˜ubc nn (REQ) ˜ubc nn (MVP) 010000 20000 Epochs 10 −1
+> 10 0
+> Error ˜ubc nn (REQ) ˜ubc nn (MVP)
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −1.0−0.50.00.51.0
+> −1.0
+> −0.50.00.51.0
+> −0.32
+> −0.16 0.00 0.16 0.32 0.48 0.64 0.80 0.96
+> −0.045
+> −0.030
+> −0.015 0.000 0.015 0.030 0.045
+> −0.05
+> −0.04
+> −0.03
+> −0.02
+> −0.01 0.00 0.01 0.02 0.03 0.04
+
+(a) (b) 
+
+(c) (d) (e)   
+
+> Figure 45: Solving the Eikonal equation using ˜ ubc nn (x;θ) to compute the signed distance to the boundary of the polygonal map of Bhutan. The network architecture is 2–50–50–1. (a), (b) Training loss and normalized absolute error in training as a function of epochs for ˜ ubc nn (x;θ) (REQ and MVP). (c) Exact signed distance function. Contour plots of the error for (d) ˜ ubc nn (x) (REQ) and (e) ˜ ubc nn (x) (MVP).
+
+Among meshfree methods, since construction of radial basis functions is dimension-independent, RBF-based mesh-free methods have had success in solving high-dimensional problems [104]. It is in these problems that the power and potential of a meshfree method such as PINN becomes most apparent. For this problem, we only consider trial func-tions, ˜ ubc nn (x; θ), which exactly enforce the homogeneous Dirichlet boundary condition on ∂Ω. One way of enforcing the boundary conditions is to assume ˜ ubc nn to be of the form ˜ubc nn (x; θ) =
+
+
+
+> 4
+
+∏
+
+> i=1
+
+(1 − x2 
+
+> i
+
+)
+
+ ˜uR
+
+> nn
+
+(x; θ),
+
+which we refer to as the ‘product method’ and note that while this is an obvious approach for the present problem, it leads to very small numbers inside the domain and away from the boundaries. In this case, the multiplicative factor scales as O(x8) inside the biunit hypercube, and therefore the network parameters have to compensate for this highly nonlinear behavior during training. It is preferable to have a multiplicative factor that is better behaved in order to aid the training. To this end, we construct φ(x) using R-equivalence in (10), which seamlessly extends to higher dimensions. For this choice, the trial function ˜ ubc nn (x; θ) is of the form: ˜ubc nn (x; θ) = φ(x) ˜ uR
+
+> nn
+
+(x; θ), (77) where φ(x) consists of R-equivalence (REQ) operations on φi(x), where φi(x) is the R-function for the region (strip) bounded by the hyperplanes 1 − xi and xi − 1. We form φi(x) = (1 − x2 
+
+> i
+
+)/2, which is an ADF that is normalized to order 1. On using the REQ composition in (10), we write 
+
+φ(x) = φ1(x) ∼ φ2(x) ∼ φ3(x) ∼ φ4(x), (78) which generalizes to the hypercube in Rd. Note that φ(x) only scales as x2 
+
+> i
+
+in each coordinate direction, and therefore is much better behaved. For this problem, another choice for φ(x) is to define two φi’s in each dimension, i.e., 
+
+φ2i−1 = 1 + xi and φ2i = 1 − xi and then define φ(x) = φ1 ∼ φ2 . . . φ 8 for the 4-dimensional hypercube. For m = 1, this construction coincides with the expression for φ in (78). In Fig. 46, we present the numerical solutions using the product and REQ ( m = 1) forms of the trial function. For both choices, we consider 5,000 randomly generated interior points in Ω for training and compute the normalized 45 error at a separate set of 5,000 interior points. The network architecture for both choices is 4–100–100–1. The isosurface plot of φ(x) ( x4 = 0 plane) using REQ is shown in Fig. 46a. In Fig. 46b, the evolution of the training loss and normalized absolute error is presented for the product and REQ trial functions. We observe that while the REQ method is able to reach error levels of about 1 percent, the product method does not converge. The ˜ ubc nn (x) solution with REQ yields O(10 −2) error, whereas the product method has O(1) error even though it has a much smaller PDE loss of O(10 −5).           
+
+> (a) 010000 20000 Epochs 10 −5
+> 10 −4
+> 10 −3
+> 10 −2
+> 10 −1
+> Training loss REQ Product 010000 20000 Epochs 10 −1
+> 10 0
+> Error  REQ Product
+> (b)
+> Figure 46: Numerical solution of the Poisson problem over the four-dimensional hypercube. The network architecture is 2–100–100–1. The trial function, ˜ ubc nn (x), is constructed using R-equivalence (REQ) and the product method. (a) Isosurface plot of the approximate distance function using REQ, φ(x), over the 3-dimensional biunit cube ( x4=0), with φ=0 being satisfied on the boundary of the cube. (b) Training loss and normalized absolute error for REQ and the product method as a function of epochs.
+
+12. Conclusions 
+
+Starting from the seminal works of Lagaris et al. [1–3] and the recent extensions and major advancements by Raissi et al. [6] and E and Yu [9], there has been a surge in the development and application of physics-informed neural networks to solve partial di ff erent equations. In this paper, we have introduced a new approach based on distance fields to construct geometry-aware approximations in physics-informed neural networks by ensuring that the necessary boundary conditions are met a priori: all boundary conditions in a collocation method, and the essential boundary conditions (kinematic admissibility) in a Ritz method. Our approach relied on the theory of R-functions [23, 24] to construct approximate distance fields and their use within a meshfree method to exactly impose boundary conditions to solve PDEs [27, 30]. Apart from R-functions, we also showed that mean value potential fields [34–36] can be used to construct suitable distance field to solve PDEs over domains with a ffi ne as well as curved boundaries. We presented several numerical examples to reveal the benefits of exactly imposing the boundary conditions versus the current state-of-the-art in deep collocation and deep Ritz methods for physics-informed neural networks. Notably, requiring only the interior residual error contribution in the loss function simplifies the training of the network and leads to more accurate numerical solutions. This was shown through several verification tests on benchmark one- and two-dimensional boundary-value problems—and consistently revealed the pitfalls of being guided by the magnitude of the loss function in standard PINN-based collocation approaches [6]. In PINN methods, when there are multiple terms (PDE loss and boundary losses) that are present in the loss function and the loss weights are fixed a priori, the magnitude of the loss function at the end of the training does not provide a measure of the accuracy of the approx-imation. There can be a many-fold di ff erence in the two error measures, which is revealed in our simulations. This is not surprising since the weights associated with each loss term is not known a priori since it depends on the PDE, boundary conditions, and the training. There is no clear rationale way to set these weights in order to ensure that the approach is robust and guaranteed to lead to reliable results. This study has reinforced that it is important that the PDE loss stands on its own, and boundary conditions are enforced via the ansatz. One approach is to construct a separate neural network to meet the boundary conditions as is done by Berg and Nystr¨ om [7], but the inherent inaccuracy in the satisfaction of the boundary conditions then propagates when training for the PDE is conducted. Moreover, when the geometry is complicated and di ff erent types of boundary conditions are imposed on di ff erent subset of the boundary, 46 then this approach may soon become impractical. Our approach ensures exact satisfaction of all necessary boundary condition, which makes it appealing—so that training of the network (training loss and PDE loss coincide) is more effi cient and accurate solutions can be realized. This study has provided a method to perform meshfree analysis—solving PDEs without domain discretization— on complex two-dimensional geometries using physics-informed neural networks. This was made possible on using approximate distance functions, which were based on R-functions (R-equivalence composition) and generalized mean value potentials, in conjunction with transfinite inverse-distance based interpolation to exactly satisfy the boundary conditions. For the problems that we considered, both ADFs delivered the same order of accuracy with the constant being smaller for R-equivalence in most cases (Eikonal equation was the exception). If u = 0 is imposed on the bound-ary of a polygon with many (tens to hundreds) edges, then the ADF using mean value potential is more e ffi cient since it results in an ADF by construction, whereas R-equivalence requires first constructing an ADF to each edge and then the joining operation to be performed. However, if di ff erent boundary conditions are imposed on distinct boundary segments then only R-equivalence is viable since this cannot be realized using the ADF based on mean value poten-tial fields. While we have used R-functions and the mean value potential to construct smooth approximate distance functions, fields such as constructive geometric modeling [105] with implicit functions [73, 106–108], PDE-based solutions for distance computations [109–111], and deep learning in computer vision [112] are rapidly advancing and may o ff er other attractive alternatives to construct smooth distance fields for use in deep neural networks to solve PDEs. A separate and more in depth investigation is needed to explore if there are other network architectures and optimizers for network training that are better-suited for the PINN ansatz with approximate distance functions, and to also quantify the accuracy that is obtained using di ff erent ADFs. Lastly, the formulation permitted exact model-ing of a ffi ne and curved boundaries, thereby providing a pathway to conducting simulations on the exact geometry (isogeometric analysis) [37]. The ideas herein can be extended for higher dimensional problems, since R-equivalence composition for the ap-proximate distance function is additive and does not su ff er from the curse of dimensionality. This was demonstrated in Section 11, where we obtained an accurate PINN solution for a Poisson problem over the 4-dimensional hypercube. Extending our formulation to complex geometries in 3D, and the development of a deep Petrov-Galerkin domain-decomposition method are topics that we plan to pursue. 
+
+13. Acknowledgments 
+
+AS acknowledges support from the NSF CAREER grant #1554033 to the Illinois Institute of Technology. NS thanks Anand Reddy, Eric Chin and Kai Hormann for many helpful discussions. 
+
+References             
+
+> [1] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artifical neural networks for solving ordinary and partial di ff erential equations, IEEE Transactions on Neural Networks 9 (5) (1998) 987–1000. [2] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artifical neural network methods in quantum mechanics, Computer Physics Communications 104 (1997) 1–14. [3] I. E. Lagaris, A. C. Likas, D. G. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries, IEEE Transactions on Neural Networks 11 (5) (2000) 1041–1049. [4] K. S. McFall, An artificial neural network method for solving boundary value problems with arbitrary irregular boundaries, Ph.D. thesis, Georgia Institute of Technology, Atlanta, GA, USA (2006). [5] K. S. McFall, J. R. Mahan, Artificial neural network method for solution of boundary value problems with exact satisfaction of arbitrary boundary conditions, IEEE Transactions on Neural Networks 20 (8) (2009) 1221–1233. [6] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for forward and inverse problems involving nonlinear partial di ff erential equations, Journal of Computational Physics 378 (2019) 686–707. [7] J. Berg, K. Nystr¨ om, A unified deep artificial neural network approach to partial di ff erential equations in complex geometries, Neuralcom-puting 317 (2018) 28–41. [8] J. Sirignano, K. Spiliopoulos, DGM: A deep learning algorithm for solving partial di ff erential equations, Journal of Computational Physics 375 (2018) 1339–1364. [9] W. E, B. Yu, The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics 6 (1) (2018) 1–12. [10] J. Han, A. Jentzen, W. E, Solving high-dimensional partial di ff erential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510. [11] E. Kharazmi, Z. Zhang, G. E. Karniadakis, Variational physics-informed neural networks for solving partial di ff erential equations (2019).
+> arXiv:1912.00873 .[12] E. Kharazmi, Z. Zhang, G. E. Karniadakis, hp-VPINNs: Variational physics-informed neural networks with domain decomposition, Com-puter Methods in Applied Mechanics and Engineering 374 (2020) 113547.
+
+47 [13] L. Lu, X. Meng, Z. Mao, G. E. Karniadakis, DeepXDE: A deep learning library for solving di ff erential equations, SIAM Review 63 (1) (2021) 208–228. [14] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX } symposium on operating systems design and implementation ( {OSDI } 16), 2016, pp. 265–283. [15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´ e Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019. [16] S. Wang, Y. Teng, P. Perdikaris, Understanding and mitigating gradient pathologies in physics-informed neural networks (2020). arXiv: 2001.04536 .[17] J. Chen, R. Du, K. Wu, A comparison study of deep Galerkin method and deep Ritz method for elliptic problems with di ff erent boundary conditions, Communications in Mathematical Research 36 (3) (2020) 354–376. [18] L. Lyu, K. Wu, R. Du, J. Chen, Enforcing exact boundary and initial conditions in the deep mixed residual method (2020). arXiv: 2008.01491 .[19] I. Babuˇ ska, U. Banerjee, J. E. Osborn, Survey of meshless and generalized finite element methods: a unified approach, Acta Numerica 12 (2003) 1–125. [20] A. Huerta, T. Belytschko, S. Fern´ andez-M´ endez, T. Rabczuk, X. Zhuang, M. Arroyo, Meshfree methods, 2nd Edition, Vol. 2 of Encyclopedia of Computational Mechanics, Wiley, 2017, Ch. 3, pp. 1–38. [21] L. V. Kantorovich, V. I. Krylov, Approximate Methods of Higher Analysis, Interscience, New York, NY, USA, 1958. [22] V. L. Rvachev, Theory of R-functions and Some Applications, Naukova Dumka, Kiev. In Russian, 1982. [23] V. L. Rvachev, T. I. Sheiko, R-functions in boundary value problems in mechanics, Applied Mechanics Reviews 48 (4) (1995) 151–188. [24] V. L. Rvachev, T. I. Sheiko, V. Shapiro, I. Tsukanov, On completeness of RFM solution structures, Computational Mechanics 25 (2000) 305–3163. [25] V. L. Rvachev, T. I. Sheiko, V. Shapiro, I. Tsukanov, Transfinite interpolation over implicitly defined sets, Computer Aided Geometric Design 18 (2001) 195–220. [26] V. Shapiro, Theory of R-functions and applications: A primer, Tech. Rep. CPA88-3, Cornell Programmable Automation, Sibley School of Mechanical Engineering, Ithaca, NY 14853, USA (1991). [27] V. Shapiro, I. Tsukanov, Meshfree simulation of deforming domains, Computer-Aided Design 31 (7) (1999) 459–471. [28] V. Shapiro, I. Tsukanov, The architecture of SAGE–a meshfree system based on RFM, Engineering with Computers 18 (4) (2002) 295–311. [29] A. Biswas, V. Shapiro, Approximate distance fields with non-vanishing gradients, Graphical Models 66 (3) (2004) 133–159. [30] V. Shapiro, Semi-analytic geometry with R-functions, Acta Numerica 16 (2007) 239–303. [31] M. Freytag, V. Shapiro, I. Tsukanov, Finite element analysis in situ, Finite Elements in Analysis and Design 47 (9) (2011) 957–972. [32] K. H¨ ollig, U. Reif, J. Wipper, Weighted extended B-spline approximation of Dirichlet problems, SIAM Journal on Numerical Analysis 39 (2) (2001) 442–462. [33] D. Mill´ an, N. Sukumar, M. Arroyo, Cell-based maximum-entropy approximants, Computer Methods in Applied Mechanics and Engineering 284 (2015) 712–731. [34] M. S. Floater, Mean value coordinates, Computer Aided Geometric Design 20 (1) (2003) 19–27. [35] C. Dyken, M. S. Floater, Transfinite mean value interpolation, Computer Aided Geometric Design 26 (1) (2009) 117–134. [36] A. Belyaev, P.-A. Fayolle, A. Pasko, Signed Lp-distance fields, Computer-Aided Design 45 (2) (2013) 523–528. [37] T. J. R. Hughes, J. A. Cottrell, Y. Bazilevs, Isogeometric analysis: CAD, finite elements, NURBS, exact geometry and mesh refinement, Computer Methods in Applied Mechanics and Engineering 194 (39–41) (2005) 4135–4195. [38] P. Thoutireddy, M. Ortiz, A variational r-adaption and shape-optimization method for finite-deformation elasticity, International Journal for Numerical Methods in Engineering 61 (1) (2004) 1–21. [39] J. He, L. Li, J. Xu, C. Zheng, ReLU deep neural networks and linear finite elements (2018). arXiv:1807.03973 .[40] E. Grinspun, The basis refinement method, Ph.D. thesis, California Institute of Technology, Pasadena, CA, USA (2003). [41] E. C. Cyr, M. A. Gulian, R. G. Patel, M. Perego, N. A. Trask, Robust training and initialization of deep neural networks: An adaptive basis viewpoint, in: Mathematical and Scientific Machine Learning, PMLR, 2020, pp. 512–536. [42] J. A. A. Opschoor, P. C. Petersen, C. Schwab, Deep ReLU networks and high-order finite element methods, Analysis and Applications 18 (05) (2020) 715–770. [43] E. J. Kansa, Multiquadrics—A scattered data approximation scheme for applications to computational fluid-dynamics. 1. Surface approxi-mations and partial derivative estimates, Computers & Mathematics with Applications 19 (8 /9) (1990) 127–145. [44] E. J. Kansa, Multiquadrics—A scattered data approximation scheme for applications to computational fluid-dynamics. 2. Solutions to parabolic, hyperboloc and elliptic partial-di ff erential equations, Computers & Mathematics with Applications 19 (8 /9) (1990) 147–161. [45] M. D. Buhmann, Radial basis functions: theory and implementations, Cambridge University Press, Cambridge, UK, 2003. [46] G. Fasshauer, Meshfree Approximation Methods in MATLAB, Interdisciplinary Mathematical Sciences – Vol. 6, World Scientific Publishers, Singapore, 2007. [47] R. Schaback, H. Wendland, Kernel techniques: from machine learning to meshless methods, Acta Numerica 15 (2006) 543. [48] M. Arroyo, M. Ortiz, Local maximum-entropy approximation schemes: a seamless bridge between finite elements and meshfree methods, International Journal for Numerical Methods in Engineering 65 (13) (2006) 2167–2202. [49] I. Babuˇ ska, J. M. Melenk, The partition of unity method, International Journal for Numerical Methods in Engineering 40 (1997) 727–758. [50] V. Rajan, Optimality of the Delaunay triangulation in Rd , Discrete & Computational Geometry 12 (1) (1994) 189–202. [51] N. Sukumar, Construction of polygonal interpolants: a maximum entropy approach, International Journal for Numerical Methods in Engi-neering 61 (12) (2004) 2159–2181. [52] N. Sukumar, Maximum entropy approximation, AIP Conference Proceedings 803 (1) (2005) 337–344. [53] M. Arroyo, M. Ortiz, Local maximum-entropy approximation schemes, in: M. Griebel, M. A. Schweitzer (Eds.), Meshfree Methods for Partial Di ff erential Equations III, Vol. 57 of Lecture Notes in Computational Science and Engineering, Springer, Berlin, Germany, 2007, pp. 1–16. 
+
+48 [54] N. Sukumar, R. W. Wright, Overview and construction of meshfree basis functions: From moving least squares to entropy approximants, International Journal for Numerical Methods in Engineering 70 (2) (2007) 181–205. [55] A. Rosolen, D. Mill´ an, M. Arroyo, On the optimum support size in meshfree methods: a variational adaptivity approach with maximum entropy approximants, International Journal for Numerical Methods in Engineering 82 (7) (2010) 868–895. [56] J. Park, I. W. Sandberg, Universal approximation using radial-basis-function networks, Neural computation 3 (1991) 246–257. [57] H. N. Mhaskar, Neural networks for optimal approximation of smooth and analytic functions, Neural computation 8 (1) (1996) 164–177. [58] K. Lee, N. A. Trask, R. G. Patel, M. A. Gulian, E. C. Cyr, Partition of unity networks: deep hp-approximation (2021). arXiv:2101. 11256 .[59] A. A. Ramabathiran, P. Ramachandran, SPINN: Sparse, physics-based, and partially interpretable neural networks for PDEs, Journal of Computational Physics 445 (2021) 110600. [60] F. Greco, M. Arroyo, High-order maximum-entropy collocation methods, Computer Methods in Applied Mechanics and Engineering 367 (2020) 113115. [61] H. Sheng, C. Yang, PFNN: A penalty-free neural network method for solving a class of second-order boundary-value problems on complex geometries, Journal of Computational Physics 428 (2021) 110085. [62] V. Dwivedi, B. Srinivasan, Physics informed extreme learning machine (PIELM)–A rapid method for the numerical solution of partial di ff erential equations, Neuralcomputing 391 (2020) 96–118. [63] V. Dwivedi, B. Srinivasan, Solution of biharmonic equation in complicated geometries with physics informed extreme learning machine, Journal of Computing and Information Science in Engineering 20 (6) (2020). [64] Y. Liao, P. Ming, Deep Nitsche method: Deep Ritz method with essential boundary conditions (2019). arXiv:1912.01309 .[65] S. Li, W. K. Liu, Meshfree Particle Methods, Springer-Verlag, New York, NY, USA, 2004. [66] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359–366. [67] K. Hornik, Approximation capabilities of multilayer perceptrons, Neural Networks 4 (1991) 251–257. [68] G. Strang, G. J. Fix, An Analysis of the Finite Element Method, Prentice–Hall, New York, NY, USA, 1973. [69] F. M. Rohrhofer, S. Posch, B. C. Geiger, On the pareto front of physics-informed neural networks (2021). arXiv:2105.00862 .[70] O. Hennigh, S. Narasimhan, M. A. Nabian, A. Subramaniam, K. Tangsali, Z. Fang, M. Rietmann, W. Byeon, S. Choudhry, NVIDIA SimNet ™: An AI-accelerated multi-physics simulation framework, in: International Conference on Computational Science, Springer, 2021, pp. 447–461. [71] I. Tsukanov, S. R. Posireddy, Hybrid method of engineering analysis: Combining meshfree method with distance fields and collocation technique, Journal of Computing and Information Science in Engineering 11 (3) (2011). [72] J. A. Sethian, Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science, Cambridge University Press, Cambridge, U.K., 1999. [73] J. Bloomenthal, Bulge elimination in convolution surfaces, Computer Graphics Forum 16 (1) (1997) 31–41. [74] V. Shapiro, I. Tsukanov, Implicit functions with guaranteed di ff erential properties, in: Proceedings of the Fifth ACM Symposium on Solid Modeling and Applications, 1999, pp. 258–269. [75] K. Upreti, T. Song, A. Tambat, G. Subbarayan, Algebraic distance estimations for enriched isogeometric analysis, Computer Methods in Applied Mechanics and Engineering 280 (2014) 28–56. [76] E. B. Chin, N. Sukumar, Modeling curved interfaces without element-partitioning in the extended finite element method, International Journal for Numerical Methods in Engineering 120 (5) (2019) 607–649. [77] A. G. Belyaev, P.-A. Fayolle, Transfinite barycentric coordinates, in: Hormann and Sukumar [80], pp. 43–62. [78] M. S. Floater, Generalized barycentric coordinates and applications, Acta Numerica 24 (2015) 161–214. [79] D. Anisimov, Barycentric coordinates and their properties, in: Hormann and Sukumar [80], pp. 3–22. [80] K. Hormann, N. Sukumar (Eds.), Generalized Barycentric Coordinates in Computer Graphics and Computational Mechanics, CRC Press, New York, NY, 2017. [81] K. Hormann, M. S. Floater, Mean value coordinates for arbitrary planar polygons, ACM Transactions on Graphics 25 (4) (2006) 1424–1441. [82] S. Bruvoll, M. S. Floater, Transfinite mean value interpolation in general dimension, Journal of Computational and Applied Mathematics 233 (7) (2010) 1631–1639. [83] T. Ju, S. Schaefer, J. Warren, Mean value coordinates for closed triangular meshes, ACM Transactions on Graphics 24 (3) (2005) 561–566. [84] E. B. Chin, N. Sukumar, Scaled boundary cubature scheme for numerical integration over planar regions with a ffi ne and curved boundaries, Computer Methods in Applied Mechanics and Engineering 380 (2021) 113796. [85] D. Shepard, A two-dimensional interpolation function for irregularly-spaced data, in: Proceedings of the 23rd ACM national conference, Association for Computing Machinery, New York, New York, 1968, pp. 517–524. [86] F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain., Psychological Review 65 (6) (1958) 386. [87] Y. LeCun, The MNIST database of handwritten digits, http: // yann.lecun.com /exdb /mnist / (1998). [88] D. Finol, Y. Lu, V. Mahadevan, A. Srivastava, Deep convolutional neural networks for eigenvalue problems in mechanics, International Journal for Numerical Methods in Engineering 118 (5) (2019) 258–275. [89] K. Hornik, Some new results on neural network approximation, Neural Networks 6 (1993) 1069–1072. [90] A. LeNail, Nn-svg: Publication-ready neural network architecture schematics, Journal of Open Source Software 4 (33) (2019) 747. [91] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization (2014). arXiv:1412.6980 .[92] E. Samaniego, C. Anitescu, S. Goswami, V. M. Nguyen-Thanh, H. Guo, K. Hamdia, X. Zhuang, T. Rabczuk, An energy approach to the solution of partial di ff erential equations in computational mechanics via machine learning: Concepts, implementation and applications, Computer Methods in Applied Mechanics and Engineering 362 (2020) 112790. [93] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, Q. Zhang, JAX: composable transformations of Python +NumPy programs URL http://github.com/google/jax (2018). [94] E. Bisong, Google colaboratory, in: Building Machine Learning and Deep Learning Models on Google Cloud Platform, Springer, 2019, pp. 59–64. [95] N. Schl¨ omer, J. Hariharan, dmsh, (2020). Available at https://github.com/nschloe/dmsh . Accessed on April 1, 2021. [96] P.-O. Persson, G. Strang, A simple mesh generator in MATLAB, SIAM Review 46 (2) (2004) 329–345. 
+
+49 [97] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, A. Courville, On the spectral bias of neural networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 5301–5310. [98] S. Wang, H. Wang, P. Perdikaris, On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks, Computer Methods in Applied Mechanics and Engineering 384 (2021) 113938. [99] J. Pushkar, M. Meyer, T. DeRose, B. Green, T. Sanocki, Harmonic coordinates for character articulation, ACM Transactions on Graphics 26 (3) (2007) Article 71. [100] J. R. Shewchuk, Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator, in: M. C. Lin, D. Manocha (Eds.), Applied Computational Geometry: Towards Geometric Engineering, Vol. 1148 of Lecture Notes in Computer Science, Springer-Verlag, 1996, pp. 203–222. [101] S. P. Timoshenko, S. Woinowsky-Krieger, Theory of Plates and Shells, 2nd Edition, McGraw Hill, New York, NY, USA, 1959. [102] H. Guo, X. Zhuang, T. Rabczuk, A deep collocation method for the bending analysis of Kirchho ff plate (2021). arXiv:2102.02617 .[103] H. Zhao, A fast sweeping method for Eikonal equations, Mathematics of Computation 74 (250) (2005) 603–627. [104] T. Cecil, J. Qian, S. Osher, Numerical methods for high dimensional Hamilton–Jacobi equations using radial basis functions, Journal of Computational Physics 196 (1) (2004) 327–347. [105] A. Ricci, A constructive geometry for computer graphics, The Computer Journal 16 (2) (1973) 157–160. [106] A. Sherstyuk, Kernel functions in convolution surfaces: a comparative analysis, The Visual Computer 15 (4) (1999) 171–182. [107] L. Barthe, B. Wyvill, E. De Groot, Controllable binary CSG operators for “soft objects”, International Journal of Shape Modelling 10 (02) (2004) 135–154. [108] O. Gourmel, L. Barthe, M.-P. Cani, B. Wyvill, A. Bernhardt, M. Paulin, H. Grasberger, A gradient-based implicit blend, ACM Transactions on Graphics 32 (2) (2013) 12:1–12:12. [109] A. Belyaev, P.-A. Fayolle, On variational and PDE-based distance function approximations, Computer Graphics Forum 34 (8) (2015) 104– 118. [110] K. Crane, C. Weischedel, M. Wardetzky, The heat method for distance computation, Communications of the ACM 60 (11) (2017) 90–99. [111] A. G. Belyaev, P.-A. Fayolle, A variational method for accurate distance function estimation, in: Numerical Geometry, Grid Generation and Scientific Computing, Springer, Cham, 2019, pp. 175–181. [112] J. J. Park, P. Florence, J. Straub, R. Newcombe, S. Lovegrove, Deepsdf: Learning continuous signed distance functions for shape represen-tation, in: Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 165–174. 
+
+50
diff --git a/pinn/docs/evidence/wang2022_causal_training.md b/pinn/docs/evidence/wang2022_causal_training.md
new file mode 100644
index 0000000..e50ed34
--- /dev/null
+++ b/pinn/docs/evidence/wang2022_causal_training.md
@@ -0,0 +1,1897 @@
+Title: 2203.07404v1.pdf
+
+URL Source: https://arxiv.org/pdf/2203.07404
+
+Published Time: Mon, 23 Jan 2023 13:55:29 GMT
+
+Number of Pages: 35
+
+Markdown Content:
+# RESPECTING CAUSALITY IS ALL YOU NEED FOR TRAINING PHYSICS -INFORMED NEURAL NETWORKS 
+
+Sifan Wang 
+
+Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104 
+
+sifanw@sas.upenn.edu 
+
+Shyam Sankaran 
+
+Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 
+
+shyamss@seas.upenn.edu 
+
+Paris Perdikaris 
+
+Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 
+
+pgp@seas.upenn.edu 
+
+March 16, 2022 
+
+## ABSTRACT 
+
+While the popularity of physics-informed neural networks (PINNs) is steadily rising, to this date PINNs have not been successful in simulating dynamical systems whose solution exhibits multi-scale, chaotic or turbulent behavior. In this work we attribute this shortcoming to the inability of existing PINNs formulations to respect the spatio-temporal causal structure that is inherent to the evolution of physical systems. We argue that this is a fundamental limitation and a key source of error that can ultimately steer PINN models to converge towards erroneous solutions. We address this pathology by proposing a simple re-formulation of PINNs loss functions that can explicitly account for physical causality during model training. We demonstrate that this simple modification alone is enough to introduce significant accuracy improvements, as well as a practical quantitative mechanism for assessing the convergence of a PINNs model. We provide state-of-the-art numerical results across a series of benchmarks for which existing PINNs formulations fail, including the chaotic Lorenz system, the Kuramoto–Sivashinsky equation in the chaotic regime, and the Navier-Stokes equations in the turbulent regime. To the best of our knowledge, this is the first time that PINNs have been successful in simulating such systems, introducing new opportunities for their applicability to problems of industrial complexity. 
+
+Keywords Deep learning · Partial differential equations · Computational physics · Chaotic systems 
+
+## 1 Introduction 
+
+Physics-informed neural networks (PINNs) have emerged as a promising framework for synthesizing observational data and physical laws across diverse applications in science and engineering [ 1, 2, 3, 4, 5, 6, 7, 8]. However, it is well known that PINNs often face severe difficulties and even fail to tackle problems whose solution exhibits highly nonlinear, multi-scale, or chaotic behavior [ 9, 10 ]. Over the last few years, a series of extensions to the original formulation of Raissi et al. [ 11 ] have been proposed with the sole goal of enhancing the accuracy and robustness of PINNs in tackling increasingly more challenging problems. Such extensions include, but are not limited to, novel optimization algorithms for adaptive training [ 12 , 13 , 14 , 15 ], adaptive algorithms for selecting batches of training data [ 16 , 17 ], novel network architectures [ 12 , 9, 18 , 19 , 20 ], domain decomposition strategies [ 21 , 22 ], new types of activation 
+
+> arXiv:2203.07404v1 [cs.LG] 14 Mar 2022
+
+A PREPRINT - M ARCH 16, 2022 functions [ 23 ], and sequential learning strategies [ 16 , 24 , 25 ]. Although these techniques have been successful in introducing some improvements in terms of trainability and accuracy, there still exists a vast suite of problems that remain elusive to PINNs. Examples of such problems include systems whose behavior exhibits strong non-linearity, broadband energy spectra, and high sensitivity to initial conditions, such as the chaotic Kuramoto-Sivishinski equation and the Navier-Stokes equations in the turbulent regime. These are not pathological corner cases, but cases that are extremely relevant across a multitude of realistic scenarios in science and engineering. Therefore, there is a pressing need for understanding why PINNs fall short in such scenarios, and how they can be improved in order to overcome the challenges that currently limit their success to relatively simple problems. Physical systems are known to possess an inherent causal structure. Consider for example a linear wave with some initial velocity that is spreading out with a speed c across a homogeneous medium [ 26 ]. It is well-understood that, although a part of the wave may lag behind (if there is an initial velocity), no part can travel faster than speed c. This assertion encapsulates the so-called principle of causality that dictates how local changes in the initial/boundary data of a spatio-temporal dynamical system is reflected in its corresponding states at later times [ 26 ]. Specific to hyperbolic partial differential equations (PDEs), such as the wave equation, this principle underpins the formulation of the method of characteristics [ 27 ] that provides a rigorous set of analytical and numerical tools for efficiently tackling initial value problems. Although characterizing how information propagates in general nonlinear PDEs is a challenging task, basic principles of causality such as temporal precedence and covariation (i.e. statistical dependency between variables that are generated by coupled time evolution) are still expected to hold. This causal structure is also clearly reflected in classical numerical methods, where a PDE is typically discretized in time by sequential algorithms which ensure that the solution at time t is fully resolved before approximating the solution at time t + ∆ t. Strikingly, this notion of temporal dependence is absent in most continuous-time PINNs formulations (see e.g. [ 28 , 29 , 30 , 21 , 12 , 13 , 23 ]). In fact, as we will see in section 3, continuous-time PINNs trained by gradient descent are implicitly biased towards first approximating PDE solutions at later times, before even resolving the initial conditions, therefore profoundly violating temporal causality. Consequently, it is no surprise that such formulations are fragile and often fail to simulate forward problems, especially in cases where the target solutions exhibit strong dependence on initial data (e.g. chaotic systems). Recent studies [ 16 , 24 , 25 ] have proposed remedies to this issue by empirically introducing sequential training strategies, yet a concrete justification of why such strategies appear to be effective is still missing. This work is focused on investigating the importance of respecting physical causality during the training of continuous-time PINNs. Our specific contributions can be summarized as: • We reveal an implicit bias suggesting that continuous-time PINNs models can violate causality, and hence are susceptible to converge towards erroneous solutions. • We put forth a simple re-formulation of PINNs loss functions that allows us to explicitly respect the causal structure that characterizes the solution of general nonlinear PDEs. • Strikingly, we demonstrate that this simple modification alone is enough to introduce significant accuracy improvements, allowing us to tackle problems that have remained elusive to PINNs. • We provide a practical quantitative criterion for assessing the training convergence of a PINNs model. • We examine a collection of challenging benchmarks for which existing PINNs formulations fail, and demon-strate that the proposed causal training strategy leads to state-of-the-art results. To the best of our knowledge, this is the first time that PINNs have been successful in simulating systems such as the chaotic Lorenz system, the Kuramoto–Sivashinsky equation in the chaotic regime, and the Navier-Stokes equations in the turbulent regime, introducing new opportunities for their applicability to problems of industrial complexity. The paper is structured as follows. In section 2, we provide an overview of PINNs following the original formulation of Raissi et. al. [ 11 ]. Using a simple case study, we reveal an implicit bias of continuous-time PINNs that makes them prone to violating physical causality, and thereby steering them towards erroneous solutions. To address this drawback, in section 3 we put forth a simple re-formulation of the PINNs residual loss and propose a general casual training 
+
+algorithm for explicitly respecting physical causality during model training. Section 4 discusses practical considerations specific to enhancing the accuracy and efficiency of PINNs. These developments are put to test in section 5, where we demonstrate state-of-the-art results across a comprehensive collection of challenging benchmarks for which existing PINN formulations are known to fail. Finally, section 6 provides a summary of our main findings, and touches upon remaining limitations and areas for future research. 2A PREPRINT - M ARCH 16, 2022 
+
+## 2 Physics-informed neural networks (PINNs) 
+
+Problem setup: We begin with a brief overview of physics-informed neural networks (PINNs) [ 11 ] in the context of inferring the solutions of PDEs. Generally, we consider PDEs taking the form 
+
+ut + N [u] = 0 , t ∈ [0 , T ], x ∈ Ω, (2.1) subject to the initial and boundary conditions 
+
+u(0 , x) = g(x), x ∈ Ω, (2.2) 
+
+B[u] = 0 , t ∈ [0 , T ], x ∈ ∂Ω, (2.3) where N [·] is a linear or nonlinear differential operator, and B[·] is a boundary operator corresponding to Dirichlet, Neumann, Robin, or periodic boundary conditions. In addition, u describes the unknown latent solution that is governed by the PDE system of Equation 2.1. Following the original work of Raissi et al. [11 ], we proceed by representing the unknown solution u(t, x) by a deep neural network uθ (t, x), where θ denotes all tunable parameters of the network (e.g., weights and biases). Then, a physics-informed model can be trained by minimizing the following composite loss function 
+
+L(θ) = λic Lic (θ) + λbc Lbc (θ) + λr Lr (θ), (2.4) where 
+
+Lic (θ) = 1
+
+Nic Nic ∑
+
+> i=1
+
+∣∣uθ (0 , xiic ) − g(xiic )∣∣2 , (2.5) 
+
+Lbc (θ) = 1
+
+Nbc Nbc ∑
+
+> i=1
+
+∣∣B[uθ ]( tibc , xibc )∣∣2 , (2.6) 
+
+Lr (θ) = 1
+
+NrNr∑
+
+> i=1
+
+∣∣∣∣
+
+∂uθ
+
+∂t (tir , xir ) + N [uθ ]( tir , xir )
+
+∣∣∣∣
+
+> 2
+
+. (2.7) Here {xiic }Nic 
+
+> i=1
+
+, {tibc , xibc }Nbc  
+
+> i=1
+
+and {tir , xir }Nr 
+
+> i=1
+
+can be the vertices of a fixed mesh or points that are randomly sampled at each iteration of a gradient descent algorithm. Notice that all required gradients with respect to input variables or network parameters θ can be efficiently computed via automatic differentiation [31]. Moreover, the hyper-parameters 
+
+{λic , λ bc , λ r } allow the flexibility of assigning a different learning rate to each individual loss term in order to balance their interplay during model training. These weights may be user-specified or tuned automatically during training [12, 13]. 
+
+An illustrative example: To motivate the proposed methods described in section 3, let us study a representative case with which conventional PINN models are known to struggle. To this end, consider the one-dimensional Allen-Cahn equation 
+
+ut − 0.0001 uxx + 5 u3 − 5u = 0 , t ∈ [0 , 1] , x ∈ [−1, 1] , (2.8) 
+
+u(x, 0) = x2 cos( πx ), (2.9) 
+
+u(t, −1) = u(t, 1) , (2.10) 
+
+ux(t, −1) = ux(t, 1) . (2.11) This example is difficult to directly solve with the original continuous-time formulation of Raissi et al. [ 11 ], and has been recently studied by Wight et. al. [ 16 ] and McClenny et. al. [ 14 ] who developed adaptive re-sampling and weighting algorithms, respectively, to improve the PINNs prediction. Following the setup discussed in these studies [ 14 , 16 ], we represent the latent variable u by a fully-connected neural network uθ with tanh activation function, 4 hidden layers and 128 neurons per hidden layer. To further simplify the training objective 2.4, we also strictly impose the periodic BCs by embedding the input coordinates into Fourier expansion using Equation 4.8 with m = 10 (see section 4 for further details). Then the loss function 2.4 can be reduced to 
+
+L(θ) = λic Lic (θ) + λr Lr (θ), (2.12) where Lic (θ) and Lr (θ) are defined exactly the same as in Equation 2.5 and Equation 2.7. For simplicity, we create a uniform mesh of size 100 × 256 in the computational domain [0 , 1] × [−1, 1] , yielding Nic = 256 initial points and 3A PREPRINT - M ARCH 16, 2022 
+
+Nr = 25600 collocation points for enforcing the PDE residual. We also choose λic = 100 , λ r = 1 for better enforcing the initial condition. We proceed by training the resulting PINN model via full-batch gradient descent using the Adam optimizer [ 32 ] for 
+
+2 × 10 5 iterations. As shown in Figure 1, even when the periodic boundary conditions are enforced exactly, our conventional PINN model is unable to learn the accurate solution for this example. One can also observe that the predicted solution seems to get stuck at some intermediate state and cannot be further refined to provide an accurate approximation to the ground truth. This is consistent with the left panel of Figure 2 where the loss functions rapidly decrease in the first few thousand training iterations, and then barely change for the rest of training, implying that the neural network gets trapped in an erroneous local minimum. Unfortunately, such problematic behavior is not a rare event, but rather a common outcome for PINNs, especially when solving transient problems [13, 24]. 
+
+PINNs can violate physical causality: To explore the underlying reasons behind this failed case study, let us closely examine the definition of the residual loss Lr . Before doing so, we will slightly change our notation for convenience. Suppose that 0 = t1 < t 2 < · · · < t Nt = T discretizes the temporal domain, and {xj }Nx 
+
+> j=1
+
+discretizes the spatial domain Ω. For this example, {ti}Nt 
+
+> i=1
+
+and {xj }Nx 
+
+> j=1
+
+are uniformly spaced meshes in [0 , 1] and [−1, 1] , respectively. Now for a given spatial discretization {xj }Nx
+
+> j=1
+
+, we define the temporal residual loss as 
+
+Lr (t, θ) = 1
+
+NxNx∑
+
+> j=1
+
+| ∂uθ
+
+∂t (t, xj ) + N [uθ ]( t, xj )|2. (2.13) Then, the residual loss 2.7 can be rewritten as 
+
+Lr (θ) = 1
+
+NtNt∑
+
+> i=1
+
+Lr (ti, θ) (2.14) 
+
+= 1
+
+NtNxNt∑
+
+> i=1
+> Nx
+
+∑
+
+> j=1
+
+| ∂uθ
+
+∂t (ti, xj ) + N [uθ ]( ti, xj )|2. (2.15) Next, we discretize ∂uθ 
+
+> ∂t
+
+using the forward Euler scheme [ 33 ]. For any 1 ≤ i ≤ Nt − 1, L(ti, θ) can be approximated by 
+
+Lr (ti, θ) ≈ 1
+
+NxNx∑
+
+> j=1
+
+∣∣∣∣
+
+uθ (ti, xj ) − uθ (ti−1, xj )∆t + N [uθ ]( ti, xj )
+
+∣∣∣∣
+
+> 2
+
+≈ |Ω|
+
+∆t2
+
+∫
+
+> Ω
+
+|uθ (ti, x) − uθ (ti−1, x) + ∆ tN [uθ ]( ti, x)|2dx. (2.16) From the above expression, we immediately obtain that the minimization of L(ti, θ) should be based on the correct prediction of both uθ (ti, x) and uθ (ti−1, x), while the original formulation of Equation 2.7 tends to minimize all 
+
+L(ti, θ) simultaneously. As a result, by using Equation 2.7, the residual loss Lr (ti, θ) will be minimized even if the predictions at ti and previous times are inaccurate. This behavior inevitably violates temporal causality, making the PINN model susceptible to learn erroneous solutions. This conclusion is further confirmed by the middle panel of Figure 2 where we plot the temporal residual loss of Allen-Cahn equation at different iterations of training. As expected, the residual is quite large near the initial state and rapidly decays to nearly zero after t = 0 .5. We emphasize that the PDE temporal residual of small magnitude is meaningful only if the PINN model is well optimized and able to yield accurate predictions at the previous time steps. 
+
+An undesirable implicit bias: To provide a deeper understanding of the fact that PINNs may violate temporal causality, we analyze their training dynamics through the lens of their empirical Neural Tangent Kernel (NTK) [ 34 , 13 ]. Specifically, for every Lr (t, θ) (Equation 2.13), we can define the empirical NTK Kθ (t) ∈ RNx×Nx whose ij -th entry is given by [13] 
+
+Kθ (t)ij =
+
+〈 ∂Rθ
+
+∂θ (t, xi), ∂Rθ
+
+∂θ (t, xj )
+
+〉
+
+, (2.17) where Rθ is the corresponding PDE residual defined by 
+
+Rθ (t, x) = ∂uθ
+
+∂t (t, x) + N [uθ ]( t, x), i, j = 1 , 2, . . . , N x. (2.18) 4A PREPRINT - M ARCH 16, 2022 −1 0 1
+
+> x
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0   
+
+> −101
+> x
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0 .5  
+
+> −101
+> x
+> −101
+> u(t, x )
+
+t = 1 .0
+
+Figure 1: Allen-Cahn equation: Top: Reference solution versus the prediction of a trained conventional physics-informed neural network. The resulting relative L2 error is 49 .87% . Bottom: Comparison of the predicted and reference solutions corresponding to the three temporal snapshots at t = 0 .0, 0.5, 1.0.0.0 0.5 1.0 1.5 2.0               
+
+> Iteration ×10 5
+> 10 −4
+> 10 −2
+> 10 0
+> Loss
+> Lic
+> Lr
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 0.00.51.01.5
+> L(t, θ )
+> ×10 −1
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 234
+> C(t)
+> ×10 3
+> Iter = 1 ×10 3
+> Iter = 1 ×10 4
+> Iter = 1 ×10 5
+
+Figure 2: Allen-Cahn equation: Left: Loss convergence of training a conventional physics-informed neural network for 2 × 10 5 iterations. Middle: Temporal residual loss L(t, θ) at different iteration of the training. Right: Temporal convergent rate at different iteration of the training. As demonstrated by Wang et. al. [ 13 ], the eigenvalues of Kθ (t) determine the convergence rate of each Lr (t, θ)
+
+contributing to the total residual loss Lr (θ). Specifically, larger eigenvalues implies faster convergence rate. Following [13], we introduce the definition 
+
+Definition 2.1. For any given t ∈ [0 , T ], the temporal convergence rate C(t) of Lr (t, θ) is defined by 
+
+C(t) = 
+
+∑Nt 
+
+> k=1
+
+λk(t)
+
+Nt
+
+= Trace (Kθ (t)) 
+
+Nt
+
+, (2.19) 
+
+where {λk(t)}Nt 
+
+> k=1
+
+are the eigenvalues of Kθ (t).
+
+Equipped with definition 2.19, we visualize C(t) at different iterations during the training of our PINNs model for solving Allen-Cahn equation. In the right panel of Figure 2, it can be seen that C(t) is greater if t is greater, indicating that the network is biased towards minimizing the temporal residual Lr (t, θ) for larger t. This reveals an undesirable implicit bias of continuous-time PINN models trained via gradient descent, suggesting that such models can profoundly violate the temporal causal structure that is inherent to time-dependent PDE systems. We argue that this inherent pathology of PINNs is the key underlying reason behind their inability to simulate transient problems that exhibit strong temporal correlations and sensitivity to initial data. In the next section we put forth a remarkably simple and effective strategy for explicitly respecting physical causality during the training phase PINNs. 5A PREPRINT - M ARCH 16, 2022 Method Relative L2 error Original formulation of Raissi et al. [11] 4.98 e − 01 
+
+Adaptive time sampling [16] 2.33 e − 02 
+
+Self-attention [14] 2.10 e − 02 
+
+Time marching [25] 1.68 e − 02 
+
+Causal training (MLP) 1.43e − 03 
+
+Causal training (modified MLP) 1.39e − 04 
+
+Table 1: Allen-Cahn equation: Relative L2 errors obtained by different approaches. 
+
+## 3 Causal training for physics-informed neural networks 
+
+A simple re-formulation: Based on our findings in the previous section, it is natural to ask how we can respect physical causality when solving PDEs with PINNs. We answer this question by introducing a simple re-formulation of the PINNs training objective that can explicitly account for the missing causal structure. To this end, we define a weighted residual loss as 
+
+Lr (θ) = 1
+
+NtNt∑
+
+> i=1
+
+wiLr (ti, θ). (3.1) We recognize that the weights wi should be large – and therefore allow the minimization of Lr (ti, θ) – only if all residuals {L r (tk, θ)}ik=1 before ti are minimized properly, and vice versa. This can be achieved by expressing the weights wi as 
+
+wi = exp 
+
+(
+
+−
+
+> i−1
+
+∑
+
+> k=1
+
+Lr (tk, θ)
+
+)
+
+, for i = 2 , 3, . . . , N t, (3.2) where  will be referred to as a causality parameter that controls the steepness of the weights wi (see below for a more detailed discussion). As such, the weighted residual loss can be written as 
+
+Lr (θ) = 1
+
+NtNt∑
+
+> i=1
+
+exp 
+
+(
+
+−
+
+> i−1
+
+∑
+
+> k=1
+
+Lr (tk, θ)
+
+)
+
+Lr (ti, θ). (3.3) Notice that wi is inversely exponentially proportional to the magnitude of the cumulative residual loss from the previous time steps. As a consequence, Lr (ti, θ) will not be minimized unless all previous residuals {L r (tk, θ)}i−1 
+
+> k=1
+
+decrease to some small value such that wi is large enough. We now employ this simple modification and revisit the Allen-Cahn case study discussed before. We proceed by training the same network by minimizing the loss of Equation 2.4 using the weighted residual loss of Equation 3.3 with 
+
+ = 100 , for 3 × 10 5 iterations of gradient descent under exactly the same hyper-parameter settings. The results of this experiment are summarized in Figure 3. One can see that the predicted solution achieves an excellent agreement with the ground truth, yielding an approximation error of 1.43 e − 03 measured in the relative L2 norm. The left panel of Figure 4 presents the convergence of the different loss function components, which is evidently much better than the one presented in Figure 2. Here we note that no other modifications between the two cases exist, besides the use of the proposed weighted residual loss of Equation 3.3. In fact, if in conjunction with the weighted residual loss we also employ a more powerful architecture for this example, such as the modified MLP [ 12 ] described in section 4, then we can achieve an even more accurate result with a resulting relative L2 error of 1.39 e − 04 . Additional detailed visual assessments for this example are provided in Appendix D. Finally, in Table 1 we provide the accuracy reported for this problem by existing approaches in the literature [ 14 , 16 , 25 ]. It is evident that the proposed methodology outperforms the best reported result of competing approaches by a factor of 
+
+∼10-100x. This is a strong indication of the significance and necessity of respecting causality in training PINNs. 
+
+A stopping criterion for assessing training convergence: To understand the effect of the residual weights {wi},we present the temporal residual loss and weights at different iterations of gradient descent in the middle and right panel of Figure 4 and Figure 16. We observe that the initial temporal weights are all zero except for t = 0 , implying that only 6A PREPRINT - M ARCH 16, 2022 −1 0 1
+
+> x
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0   
+
+> −101
+> x
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0 .5  
+
+> −101
+> x
+> −101
+> u(t, x )
+
+t = 1 .0
+
+Figure 3: Allen-Cahn equation: Top: Reference solution versus the prediction of a trained physics-informed neural network using Algorithm 1. The resulting relative L2 error is 1.43 e − 03 . Bottom: Comparison of the predicted and reference solutions corresponding to the three temporal snapshots at t = 0 .0, 0.5, 1.0.0 1 2 3                 
+
+> Iteration ×10 5
+> 10 −7
+> 10 −5
+> 10 −3
+> 10 −1
+> Loss
+> Lic
+> Lr
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 10 −5
+> 10 −3
+> 10 −1
+> L(t, θ )
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 0.00.20.40.60.81.0
+> Temproal weights  w
+> Iter = 0 Iter = 1 ×10 3
+> Iter = 1 ×10 4
+> Iter = 1 ×10 5
+> Iter = 3 ×10 5
+
+Figure 4: Allen-Cahn equation: Left: Loss convergence of training a physics-informed neural network using Algorithm 1. Middle: Temporal residual loss L(t, θ) at different iteration of the training. Right: Temporal weights at different iteration of the training. 
+
+Lr (t0, θ) will be minimized at the beginning of training. Throughout the rest of the training, more temporal weights are activated, and eventually, all of them converge to 1 as the PDE residual loss is properly minimized. This last observation suggests that monitoring the magnitude of the residual weights {wi} can provide an effective stopping criterion for assessing the convergence of a PINNs model during training. Specifically, one can choose to terminate training of 
+
+min i wi > δ , for some chosen threshold parameter δ ∈ (0 , 1) . As we will see in section 5, this stopping criterion not only helps to train a PINNs model faster, but it actually yields trained models with superior predictive accuracy. 
+
+Sensitivity on the causality parameter : Here we must note that the results obtained using the proposed weighted residual loss do exhibit some sensitivity to the causality parameter  in Equation 3.2. Choosing a very small  can prevent the network from effectively minimizing the latter temporal residuals. On the other hand, choosing a large 
+
+ value can result in a more difficult optimization problem, because the temporal residuals at earlier times have to decrease to a very small value in order to activate the latter temporal weights. This may be hard to achieve in some cases due to limited network capacity in minimizing the target residuals. In order to avoid tedious hyper-parameter tuning, we employ an annealing strategy for adjusting  using an increasing sequence of values {i}ki=1 , which gradually increases the strength with which the PDE residual constraint is enforced. As we will see in section 5, we empirically observe that this choice yields the best results in practice. 
+
+Fitting the initial data: In the spirit of respecting causality, one may recognize that all temporal residuals should be minimized only if the network can first accurately fit the initial data. Therefore, we may treat the initial loss Lic as a special temporal residual at t = 0 and incorporate it into the weighted residual loss of Equation 3.1 in the same manner. 7A PREPRINT - M ARCH 16, 2022 
+
+Causal training for PINNs: Based on the above remarks, Algorithm 1 presents a general causal training algorithm for PINNs. Specifically, it summarizes the proposed re-formulation of the residual and initial conditions loss, the annealing scheme for the  parameter, and the stopping criterion for terminating the training upon the convergence of the temporal weights wi. Accompanying Algorithm 1, here we present a few additional remarks worth discussing. 1. Although in this work we have limited our attention to PDEs with periodic boundary conditions that can be enforced in an exact manner (see section 4 for more details), the proposed causal training algorithm can be adapted to also incorporate boundary constraints using a similar treatment to the initial conditions loss. 2. Note that the temporal weights {wi}Nt 
+
+> i=0
+
+are a function of the trainable parameters θ. We use lax.stop_gradient 
+
+in our JAX [35] implementation to prevent gradient back-propagation through the computation of wi.3. The computational cost of the proposed algorithm is negligible compared to conventional PINNs formulations since the weights wi are computed by directly evaluating the PINNs loss functions, whose values are already stored in the computational graph during training. 4. The proposed algorithm is not limited to fixed mesh points for evaluating the PINNs loss terms, and the collocation points can be randomly sampled at each iteration of gradient descent. The only requirement is that the sampled temporal points {ti}Nt 
+
+> i=1
+
+should form a non-decreasing sequence in temporal domain so that temporal causality can be respected. Here we should also mention that Algorithm 1 is general and can be employed within any existing physics-informed machine learning pipeline, including physics-informed neural networks [ 11 , 36 , 30 , 19 , 21 , 37 ], physics-informed deep operator networks [38, 39, 40], and physics-informed neural operators [41]. 
+
+Algorithm 1: Causal training for physics-informed neural networks Consider a physics-informed neural network uθ (t, x) imposed the exact boundary conditions, and the corresponding weighted loss function 
+
+L(θ) = 1
+
+NtNt∑
+
+> i=0
+
+wiL(ti, θ), (3.4) where L(t0, θ) = λic Lic (θ) and for 1 ≤ i ≤ Nt, L(ti, θ) is defined in Equation 2.13. All wi are initialized by 1. Then use S steps of a gradient descent algorithm to update the parameters θ as: 
+
+for  = 1, . . . ,  k do for n = 1 , . . . , S do 
+
+(a) Compute and update the temporal weights by 
+
+wi = exp 
+
+(
+
+−
+
+> i−1
+
+∑
+
+> k=1
+
+L(tk, θ)
+
+)
+
+, for i = 2 , 3, . . . , N t. (3.5) Here  > 0 is a user-defined hyper-parameter that determines the "slope" of temporal weights. (b) Update the parameters θ via gradient descent 
+
+θn+1 = θn − η∇θ L(θn). (3.6) 
+
+if min i wi > δ then 
+
+break 
+
+end end end 
+
+The recommended hyper-parameters are λic = 10 3, δ = 0 .99 and {i}ki=1 = [10 −2, 10 −1, 10 0, 10 1, 10 2].
+
+Connection to existing approaches: It is worth pointing out that the proposed residual weighting strategy bears some similarity to the adaptive time sampling of Wight et al. [16 ], since the effect of the weights wi can be viewed as equivalent to changing the sampling density of collocation points. However, the method of Wight et al. has two main disadvantages in practice: a) the sampling density has to be manually designed for different problems and training iterations, and b) an accurate approximation of the designed sampling density requires a large volume of collocation points, leading to a large computational cost. Besides, we remark that our method shares the same motivation with "time-marching" or "curriculum training" strategies [ 16 , 24 , 42 , 43 ], in the sense of respecting temporal causality by 8A PREPRINT - M ARCH 16, 2022 learning the solution sequentially within separate time-windows. In fact, our causal training strategy should not be viewed as a replacement of time-marching approaches, but instead as a crucial enhancement to those, given the fact that violations of causality may still occur within each time window of a time-marching algorithm. 
+
+## 4 Practical considerations 
+
+As we will see in section 5, high-order accuracy becomes a necessity for PINNs in order to tackle problems exhibiting sensitivity on initial data and strong spatio-temporal correlations (e.g. chaotic systems). Although PINNs are known for being incapable to achieve high-order accuracy in general, in this section we highlight a few extensions that can further enhance their performance in more challenging settings. Although these features are not deemed crucial for the successful application of Algorithm 1, we have empirically observed that, for the problems considered in this work, they can lead to further enhancements in terms of accuracy and computational efficiency. 
+
+Modified multi-layer perceptrons: In [ 12 ] Wang et al. put forth a novel architecture that was demonstrated to outperform conventional MLPs across a variety of PINNs benchmarks. Here, we will refer to this architecture as "modified MLP". The forward pass of a L-layer modified MLP is defined as follows 
+
+U = σ(XW 1 + b1), V = σ(XW 2 + b2), (4.1) 
+
+H(1) = σ(XW (l) + b(l)), (4.2) 
+
+Z(l) = σ(H(k)W (l) + b(l)), l = 1 , . . . , L − 1, (4.3) 
+
+H(l+1) = (1 − Z(l)) U + Z(l) V , l = 1 , . . . , L − 1, (4.4) 
+
+uθ (X) = H(L)W (L) + b(L), (4.5) where σ denotes a nonlinear activation function, denotes a point-wise multiplication, and X denotes an batch of input coordinates. All trainable parameters are given by 
+
+θ = {W1, b1, W2, b1, (W (l), b(l))Ll=1 }. (4.6) At first glance, this architecture seems to appear a bit complicated. However, notice that it is almost the same as a standard MLP network, with the addition of two encoders and a minor modification in the forward pass. Specifically, the inputs X are embedded into a feature space via two encoders U , V , respectively, and merged in each hidden layer of a standard MLP using a point-wise multiplication. Based on our prior experience, the modified MLP architecture is shown to be more powerful than standard MLPs in terms of minimizing the PDE residuals and capturing sharp gradients [12, 9, 38, 39]. 
+
+Exact periodic boundary conditions: Recent work by Dong et al. [44 ] showed how one can strictly impose periodic boundary conditions in PINNs as hard-constraints. We have empirically observed that this trick can simplify the training of PINNs and introduce some savings in terms of computational cost. To illustrate the main idea, let us consider enforcing periodic boundary conditions with period P in a one-dimensional setting. To this end, we would like to make sure that a neural network returns periodic predictions as 
+
+u(l)(a) = u(l)(a + P ), l = 0 , 1, 2, . . . . (4.7) To enforce this constraint as part of the architecture itself, we construct a Fourier feature embedding of the form 
+
+v(x) = (1 , cos( ωx ), sin( ωx ), cos(2 ωx ), sin(2 ωx ), · · · , cos( mωx ), sin( mωx )) , (4.8) with ω = 2πL , and some non-negative integer m. Then, for any network representation uθ , it can be proved that any 
+
+uθ (v(x)) exactly satisfies the periodic constraint of Equation 4.7 (see [44] for a proof). The same idea can be extended to higher-dimensional domains. For instance, let (x, y ) denote the coordinates of a point in two dimensions, and suppose that u(x, y ) is a smooth periodic function to be approximated in a periodic cell 
+
+[a, a + Px] × [b, b + Py ], satisfying the following constraints 
+
+∂l
+
+∂x l u (a, y ) = ∂l
+
+∂x l u (a + Px, y ) , y ∈ [b, b + Py ] , (4.9) 
+
+∂l
+
+∂y l u (x, a ) = ∂l
+
+∂y l u (x, b + Py ) , x ∈ [a, a + Px] , (4.10) 9A PREPRINT - M ARCH 16, 2022 for l = 0 , 1, 2, . . . , where Px and Py are the periods in the x and y directions, respectively. Similar to the one-dimensional setting, these constraints can be implicitly encoded in a neural network by constructing a two-dimensional Fourier features embedding as 
+
+v(x, y ) = 
+
+
+
+cos ( ωxx) cos ( ωy y) , . . . , cos ( nω xx) cos ( mω y y)cos ( ωxx) sin ( ωy y) , . . . , cos ( nω xx) sin ( mω y y)sin ( ωxx) cos ( ωy y) , . . . , sin ( nω xx) cos ( mω y y)sin ( ωxx) sin ( ωy y) , . . . , sin ( nω xx) sin ( mω y y)
+
+ (4.11) with wx = 2πPx , w y = 2πPy and m, n being some non-negative integers. Following [ 44 ], any network representation 
+
+uθ (v(x, y )) is guaranteed to be periodic in the x, y directions. For time-dependent problems, we simply concatenate the time coordinates t with the constructed Fourier features embedding, i.e., uθ ([ t, v(x)]) , or uθ ([ t, v(x, y )]) . Although in this work we will only consider periodic problems, other types of boundary conditions, including Dirichlet, Neumann, Robin, etc., can also be enforced in a "hard" manner, see [45, 46] for more details. 
+
+Taylor-mode automatic differentiation for high-order derivatives: Conventional forward- or reverse-mode auto-matic differentiation is known to incur a cost that scales exponentially – both in terms of memory and computation – with the order of differentiation. This can quickly introduce a bottleneck in cases where derivatives of order higher than two are required (see e.g. the Kuramoto-Sivashinsky benchmark considered in section 5). To address this drawback, here we employ Taylor-mode automatic differentiation [ 31 ] in order to accelerate the computation of high-order derivatives. This is accomplished by leveraging a truncated Taylor polynomial approximation that allows for efficient computation of high-order derivatives of function compositions via the Faà di Bruno formula [31] 
+
+∂n
+
+∂x 1 · · · ∂x n
+
+f (g(x)) = ∑
+
+> σ∈π{1,...,n }
+
+f (|σ|)(g(x)) ∏
+
+> b∈σ
+
+∂|b|
+
+∏ 
+
+> j∈b
+
+∂x j
+
+g(x), (4.12) where π{1, . . . , n } is the set of all partitions of the set {1, . . . , n }. It has been shown that Taylor-mode automatic differentiation enjoys much better scaling than conventional forward-mode or reverse-mode automatic differentiation, with its benefits becoming increasingly more dramatic as the order of differentiation is increased [ 47 ]. In terms of implementation, we leverage the jax.jet primitive accompanying the work of Bettencourt et al. [47, 35]. 
+
+Parallel Training: Graphics processing units (GPUs) are the prevailing hardware choice for training PINNs, however these devices are often bound by their memory capacity. For more complex simulation scenarios (e.g. the Navier-Stokes benchmark in section 5) we have empirically observed that using larger batch sizes during training leads to enhanced convergence and predictive accuracy. However, a desirable batch size might exceed the available memory that a single GPU can offer, therefore motivating the use of data-parallelism across multiple GPU devices. In order to facilitate this, we utilize synchronous data-parallelism across multiple GPUs, with each GPU storing an identical copy of all trainable parameters. In this paradigm, a batch of training data is split into sub-batches, one for each device. Specifically, batches of spatial and temporal points used to evaluate the training loss are generated randomly and independently on each available GPU, and gradients of the training loss are then aggregated across all devices with a collective reduce-mean operation. As such, each device can then update its own local copy of all trainable model parameters at each gradient descent iteration the using global gradient signal that is broadcasted across all devices. In our implementation, this is efficiently performed leveraging the jax.pmap primitive in JAX [ 35 ], allowing us to seamlessly scale our code to an arbitrary number of GPUs. The parallel performance of our implementation will be assessed via strong and weak scaling studies, as discussed in section 5.3. 
+
+## 5 Results 
+
+Our goal in this section is to demonstrate the effectiveness of the proposed causal training algorithm by providing state-of-the-art numerical results for various types of differential equations exhibiting chaotic behavior, where existing PINNs formulations are destined for failure. Specifically, we will consider the forward simulation of the chaotic Lorenz system, the Kuramoto–Sivashinsky equation, and a two-dimensional simulation of decaying turbulence governed by the incompressible Navier-Stokes equations. Although these benchmarks can all be easily tackled using conventional numerical methods, they have remained elusive to PINNs since their initial conception [ 48 , 28 ], and all the variants that followed the reincarnation of this framework by Raissi et al. [29]. Throughout all benchmarks, we will employ the modified MLP architecture discussed in section 4 equipped with hyperbolic tangent activation functions (Tanh) and initialized using the Glorot normal scheme [ 49 ], unless otherwise 10 A PREPRINT - M ARCH 16, 2022 stated. We will enforce periodic boundary conditions as hard constraints by constructing appropriate Fourier features embedding of the input, as discussed in section 4. All networks are trained via stochastic gradient descent using the Adam optimizer with default settings [ 32 ] and an exponential learning rate decay with a decay-rate of 0.9 every 5, 000 
+
+training iterations. As suggested by [ 16 , 24 , 25 ], we will also employ time-marching to reduce optimization difficulties. Specifically, we will split up the temporal domain of interest [0 , T ] into sub-domains [0 , ∆t], [∆ t, 2∆ t], . . . [T − ∆t, T ],and train networks to learn the solution in each sub-domain, where the initial condition is obtained from the prediction of the previously trained network. At the end of training, the resulting PINN model can produce predictions for the target solution at any continuous query location in the global spatio-temporal domain. All hyper-parameter settings, computational costs, implementation details and validation metrics are all discussed in the Appendix. The code and data accompanying this manuscript will be made publicly available at https: //github.com/PredictiveIntelligenceLab/CausalPINNs .
+
+5.1 Lorentz system 
+
+As our first example, we consider the chaotic Lorenz system. It is well known that this system exhibits strong sensitivity to its initial conditions, which can trigger divergent trajectories in finite time if the numerical predictions sought are not sufficiently accurate. The system is described by the following ordinary differential equations 
+
+dx
+
+dt = σ(y − x), (5.1) 
+
+dy
+
+dt = x(ρ − z) − y, (5.2) 
+
+dz
+
+dt = xy − βz. (5.3) These equations arise in studies of convection and instability in planetary atmospheric convection, where x, y, and z
+
+denote variables proportional to convective intensity, horizontal, and vertical temperature differences [ 50 ]. Parameters 
+
+ρ, σ and β denote the Prandtl number, Rayleigh number, and a geometric factor, respectively. The Lorenz system is well-known to be chaotic for certain parameter values and initial conditions. Here, we consider a classical setting with 
+
+σ = 3 , ρ = 28 , and β = 8 /3. Our goal is to construct a PINNs model for learning the ODE solution up to time T = 20 ,starting from an initial condition [x(0) , y (0) , z (0)] = [1 , 1, 1] that does not lie on the system’s attractor. The employed PINN model architecture and training hyper-parameters are discussed in Appendix B. Figure 5 shows the predicted trajectory against the reference trajectory obtained via a classical numerical solver (see Appendix B for more details), where an excellent agreement can be observed with a relative L2 error 1.139 e −
+
+02 , 1.656 e − 02 , 7.038 e − 03 for the x, y, z components, respectively. Moreover, all training losses are plotted in Appendix Figure 17. We can see that the stopping criterion min i wi > δ discussed in section 3 is satisfied for the training of each time window. It is worth pointing out that the proposed stopping criterion will not only benefit the predictive accuracy, but also save lots of computational costs. To verify this, we train the network by removing the stopping criterion and training for a fixed number of iterations for each time window under exactly the same hyper-parameter setting. Interestingly, as shown in Appendix 19, the training losses can achieve slightly lower values than the ones using the stopping criterion. However, the model predictions are less accurate, as some discrepancies can be clearly observed in Appendix Figure 18. Although the reason behind this behavior still remains unclear, it appears that training the model for more iterations after the proposed stopping criterion has been met seems to give rise to over-fitting. 
+
+5.2 Kuramoto–Sivashinsky equation 
+
+The next example aims to illustrate the effectiveness of our method in tackling spatio-temporal chaotic systems. To this end, we consider one-dimensional Kuramoto–Sivashinsky equation, which has been independently derived in the context of reaction-diffusion systems [ 51 ] and flame front propagation [ 52 ]. The Kuramoto–Sivashinsky equation exhibits a wealth of spatially and temporally nontrivial dynamical behavior including chaos, and has served as a model example in efforts to understand and predict the complex dynamical behavior associated with a variety of physical systems. The equation takes the form 
+
+ut + αuu x + βu xx + γu xxxx = 0 , (5.4) subject to periodic boundary conditions and an initial condition 
+
+u(0 , x ) = u0(x). (5.5) 11 A PREPRINT - M ARCH 16, 2022 0 5 10 15 20                            
+
+> t
+> −10 010 20
+> Predicted x(t)
+> 0510 15 20
+> t
+> −20
+> −10 010 20
+> Predicted y(t)
+> 0510 15 20
+> t
+> 010 20 30 40 50 Predicted z(t)
+> 0510 15 20
+> t
+> 10 −5
+> 10 −3
+> 10 −1
+> Absolute error x(t)
+> 0510 15 20
+> t
+> 10 −5
+> 10 −3
+> 10 −1
+> Absolute error y(t)
+> 0510 15 20
+> t
+> 10 −5
+> 10 −3
+> 10 −1
+> Absolute error z(t)
+
+Figure 5: Lorentz system: Comparison between the predicted and reference solutions. 
+
+Case I (regular): We start with a relatively simple scenario by setting α = 5 , β = 0 .5, γ = 0 .005 , and a spatial domain [−1, 1] . The initial condition is given by u0(x) = − sin( πx ). Our goal is to lean the associated solution up to time T = 1 . A detailed visual assessment of the predicted solution is presented in Figure 6. In particular, we present a comparison between the reference and the predicted solutions at different time instants t = 0 , 0.5, 1.0. It can be observed that the PINNs prediction achieves an excellent agreement with the reference solutions, yielding an error of 
+
+3.49 e − 04 measured in the relative L2 norm. This is further illustrated by the temporal relative L2 error shown in the left panel of Figure 8. Particularly, one may note that the error increases drastically by one order of magnitude for 
+
+t ∈ [0 .4, 0.6] where the solution happens to experience a fast transition. This behavior is consistent with the larger loss values and the larger number of training iterations required before the stopping criterion is met, as observed in Appendix Figure 20. To highlight the computational efficiency of Taylor-mode automatic differentiation (Taylor-mode AD) discussed in section 4, here we provide a comparison in terms of computational cost against conventional reverse-mode automatic differentiation (AD) [ 31 ]. Specifically, we consider PINN models with a different number of layers and batch sizes. As shown in Figure 7, Taylor-mode AD provides a significant advantage in terms of computational efficiency, allowing us to accommodate larger architectures and batch sizes. As a consequence, for the same architecture and batch size, we have consistently observed a speed-up of 3-5x in the total training time required for Taylor-mode AD versus conventional AD. 
+
+Case II (chaotic): We proceed by solving a more challenging case exhibiting chaotic behavior, which remains stubbornly unsolved using existing PINNs formulations [ 53 ]. Specifically, we set α = 100 /16 , β = 100 /16 2, γ =100 /16 4, for a fixed spatial domain in [0 , 2π]. Starting from an initial condition in the chaotic regime, we use PINNs to solve Kuramoto–Sivashinsky equation up to time T = 0 .5. The results are summarized in Figure 9, from which one can see that the predicted solution is in good agreement with the reference solution obtained via classical spectral methods (see Appendix F for more details). The resulting relative L2 error over the entire spatio-temporal domain is 
+
+2.46 e − 02 , which is visualized in the right panel of Figure 8. These results highly suggest that the proposed causal training algorithm enables PINN models to capture the intricate chaotic behavior of this system. From a critical standpoint, here we should also mention that difficulties can still arise in simulating the long-time behavior of chaotic systems. Figure 10 summarizes our results starting with a simple initial state u0(x) = cos( x)(1 + sin( x)) ,and simulating the dynamics up to time T = 0 .9. One can observe that the predicted solution accurately captures the transition to chaos at around t = 0 .4, while eventually loses accuracy after t = 0 .8 due to the chaotic nature of the problem and the inevitable numerical error accumulation of PINNs, leading to a relative L2 error above 10% for the final state. This highlights the crucial need for further enhancing the accuracy of PINN approximations in order to retain effectiveness in such complex regimes. Long-time integration in general, has been one PINNs’ major drawbacks, and in future work we plan to address this via operator learning techniques as described in [39]. 12 A PREPRINT - M ARCH 16, 2022 −1 0 1       
+
+> x
+> −101
+> u(t, x )
+> t= 0
+> −101
+> x
+> −202
+> u(t, x )
+> t= 0 .5
+> −101
+> x
+> −202
+> u(t, x )
+> t= 1 .0
+
+Figure 6: Kuramoto–Sivashinsky equation (regular): Top: Reference solution versus the prediction of a trained physics-informed neural network using Algorithm 1. The resulting relative L2 error is 3.49 e − 04 . Bottom: Comparison of the predicted and reference solutions corresponding to the three temporal snapshots at t = 0 , 0.5, 1.0.4 6 8 10 
+
+# Layers 
+
+0.25 0.50 0.75 1.00 
+
+> Time ( ms )
+
+×10 1
+
+10 2 10 3 10 4
+
+Batch size 
+
+012
+
+> Time ( ms )
+
+×10 1 
+
+> AD Taylor-mode AD
+
+Figure 7: Kuramoto–Sivashinsky equation (regular): Left: Timing of evaluating the loss function of a PINN model with different number of layers. The rest hyper-parameters are the same as in Table 3. Right: Timing of evaluating the forward pass of a PINN model with different batch sizes. The rest hyper-parameters are the same as in Table 3. 0.0 0.5 1.0
+
+t  
+
+> 10 −5
+> 10 −4
+> 10 −3
+> Rel.  L2 error 0.00.20.4
+
+t
+
+> 10 −3
+> 10 −1
+> Rel.  L2 error
+
+Figure 8: Kuramoto–Sivashinsky equation: Left: Relative L2 errors of Case I (regular). Right: Relative L2 errors of Case II (chaotic). 13 A PREPRINT - M ARCH 16, 2022 0.0 2.5 5.0       
+
+> x
+> −202
+> u(t, x )
+> t= 0
+> 0.02.55.0
+> x
+> −202
+> u(t, x )
+> t= 0 .25
+> 0.02.55.0
+> x
+> −202
+> u(t, x )
+> t= 0 .5
+
+Figure 9: Kuramoto–Sivashinsky equation (chaotic): Top: Reference solution versus the prediction of a trained physics-informed neural network using Algorithm 1. The resulting relative L2 error is 2.26 e − 02 . Bottom: Comparison of the predicted and reference solutions corresponding to the three temporal snapshots at t = 0 , 0.25 , 0.5.
+
+Figure 10: Kuramoto–Sivashinsky equation (chaotic): Reference solution versus the prediction of a trained physics-informed neural network using Algorithm 1. The initial condition is u0(x) = cos( x)(1 + sin( x)) An animation of the solution evolution is provided at https://github.com/PredictiveIntelligenceLab/CausalPINNs# kuramotosivashinsky-equation .
+
+5.3 Navier-Stokes equation 
+
+To further emphasize the effectiveness of the proposed causal training algorithm for solving chaotic dynamical systems, in the last example, we consider a classical two-dimensional decaying turbulence example in a square domain with periodic boundary conditions. This problem can be modeled via the incompressible Navier-Stokes equations expressed in the velocity-vorticity formulation 
+
+wt + u · ∇ w = 1
+
+Re ∆w, in [0 , T ] × Ω, (5.6) 
+
+∇ · u = 0 , in [0 , T ] × Ω, (5.7) 
+
+w(0 , x, y ) = w0(x, y ), in Ω, (5.8) where u = ( u, v ) denotes the flow velocity field, w = ∇ × u denotes the vorticity, and Re is the Reynolds number. In addition, we set Ω = [0 , 2π]2 and Re = 100 . Our goal is to use PINNs to simulate the flow up to T = 1 .Figure 11 presents the predicted velocity and vorticity field at T = 1 . More detailed visual assessments are provided in Appendix G. We can see that all latent variables of interest are in good agreement with their corresponding reference solutions, yielding an error of 3.90 e−02 , 2.61 e−02 , 3.53 e−02 for u, v, w , respectively, over the entire spatio-temporal domain. This observation is further illustrated by the resulting errors reported in Figure 12 and the computed energy spectrum in Figure 13. These results highlight the remarkable effectiveness of the proposed causal training algorithm, successfully enabling the PINNs model to capture such complicated turbulent flow without any training data. 14 A PREPRINT - M ARCH 16, 2022 
+
+Figure 11: Navier-Stokes equation: Representative snapshot of the predicted velocity and vorticity versus the corre-sponding reference solution at t = 1 . An animation of the solution evolution is provided at https://github.com/ PredictiveIntelligenceLab/CausalPINNs#navier-stokes-equation .0.0 0.5 1.0    
+
+> t
+> 10 −3
+> 10 −2
+> 10 −1
+> Rel.  L2 error
+> u(t, x, y )
+> 0.00.51.0
+> t
+> 10 −3
+> 10 −2
+> Rel.  L2 error
+> v(t, x, y )
+> 0.00.51.0
+> t
+> 10 −3
+> 10 −2
+> 10 −1
+> Rel.  L2 error
+> w(t, x, y )
+
+Figure 12: Navier-Stokes equation: Relative L2 prediction errors for u, v, w , respectively. For this benchmark, we also report the performance of our parallel JAX implementation on a compute node equipped with 8 NVIDIA Ampere A6000 GPUs. We use an effective batch-size of 42 , 000 spatio-temporal points sampled in each training iteration on each GPU with a network consisting of 6 layers with 300 neurons per layer. Figure 14 presents the scaling results obtained. To conduct a strong scaling study, we keep the problem size fixed and split the batch across several GPUs. As expected, we notice a speed-up, but the benefits deteriorate as the number of GPUs is increased beyond 4. We attribute this behavior to the fact that, for a fixed problem size, the compute load assigned to each GPU decreases as the number of devices is increased, leading to an under-utilization of each device. We have also performed a weak scaling study in which the number of points sampled per GPU is fixed. Under this setting, we report excellent parallel efficiency that remains above 99% as the number of GPUs is increased. While we have only considered data-parallelism in this study, we may be able to obtain further speed-ups by considering a combination of data- and function-parallelism techniques [ 54 ] in future studies. Figure 14 also reports the effect of batch-size of training on the resulting L2 accuracy for the first time-window ( t ∈ [0 , 0.1] ). In general, we notice that an increase in batch-size results in higher accuracy of the network. This motivates the use of larger batch sizes through data-parallelism as a mechanism for enhancing the accuracy of PINNs in more challenging problems. 15 A PREPRINT - M ARCH 16, 2022 10 0 10 1 10 2        
+
+> k
+> 10 −10
+> 10 −6
+> 10 −2
+> E(k)/ ∑ E(k) k−4
+> t= 0 .0
+> 10 010 110 2
+> k
+> 10 −10
+> 10 −6
+> 10 −2
+> E(k)/ ∑ E(k) k−4
+> t= 0 .5
+> 10 010 110 2
+> k
+> 10 −10
+> 10 −6
+> 10 −2
+> E(k)/ ∑ E(k) k−4
+> t= 1 .0
+> Reference Predicted
+
+Figure 13: Navier-Stokes equation: Reference versus predicted normalized kinetic energy spectra at different time snapshots t = 0 .0, 0.5, 1.0.1 2 3 4 5 6 7 8           
+
+> # Devices
+> 2468
+> Speedup
+> Strong Scaling
+> 12345678
+> # Devices
+> 99 .499 .699 .8100 .0
+> Parallel Efficiency (%)
+> Weak Scaling
+> 4816
+> Nt
+> 256 128 64 32 16
+> Nx
+> Rel. L2error
+> 0.003 0.004 0.005 0.006 0.007 0.008
+
+Figure 14: Parallel Performance: Left: Strong Scaling: Keeping the total-batch size for the problem fixed, we evaluate the speedup obtained when the batch is split across multiple devices. Centre: Weak Scaling: Keeping the batch-size on each GPU fixed, we report the efficiency of scaling by dividing the time taken on a single device over the time taken on 
+
+n-devices. Right: Effect of batch-size: L2 error for models trained till t = 0 .1 using Nt and Nx points per iteration in the temporal and spatial domain respectively. 
+
+## 6 Discussion 
+
+Physical systems possess an inherent causal structure that explains the fundamental relationship between causes and effects governing their dynamic evolution. In this work, we show that physics-informed neural networks are prone to violating that structure when trained to infer the solution of time-dependent PDEs. Specifically, by studying the limiting neural tangent kernel of PINNs we reveal an implicit bias indicating a preference of PINNs to first minimize PDE residuals at later times, before even fitting the initial data. We argue that this fundamental drawback is one of the key reasons why PINNs can fail in practice. To resolve this shortcoming, we propose a novel causal training algorithm that can restore physical causality during the training of a PINNs model by appropriately re-weighting the PDE residual loss at each iteration of gradient descent. Interestingly, this also leads to a simple stopping criterion for effectively assessing the convergence of the total training loss. We demonstrate that this simple modification alone is sufficient to achieve 10-100x improvements in accuracy compared to competing approaches, opening the path to tackling challenging problems that were not accessible to PINNs before, such as the chaotic Lorenz and Kuramoto-Sivashinsky equations, and the incompressible Navier-Stokes equations in the turbulent regime. In this work we have solely focused on forward simulation problems, as we believe that these are the cases that most strongly expose the challenges and limitations in building PINNs models. While it is true that PINNs are currently better suited and have enjoyed far more success in tackling hybrid/inverse problems in which observational data is available, we believe that respecting causality is a crucial factor to consider when training a PINNs model, regardless of the forward/inverse nature of a given problem. To this end, in the inverse problem setting one should consider observational data as point sources of information, and ensure that PDE residuals are first adequately minimized at those locations before propagating information outwards. A more detailed exploration of this direction will be sought in future work. We must also note that different problems are likely to pose a different causal structure. For example, in optimal control one needs to predict the state of a system by evolving its dynamics forward in time from a given initial condition, but also compute sensitivities with respect to a control input by evolving the adjoint system backwards in time from a given terminal condition that depends on the final system state. In this case, what we here refer to as “temporal causality" takes a different form for the state (forward) and the co-state (adjoint) simulations. However, our main message remains the 16 A PREPRINT - M ARCH 16, 2022 same: respecting causality matters, and training algorithms for PINNs should be designed to respect how information propagates according to the underlying principles that govern the evolution of a given system. Given the rising prominence of PINNs across academic and industrial use cases, we consider this as a hallmark contribution that sets a new standard for what such models are capable of. We anticipate that the findings of this work will create new opportunities for the application of PINNs to more complex scenarios across diverse domains, including fluid mechanics, electromagnetics, quantum mechanics, and elasticity. However, despite the encouraging results reported here, there is a still gap between the current progress in PINNs research and real-world applications. We have to admit that viewing PINNs as a forward PDE solver is significantly more time-consuming than the traditional numerical solvers. Therefore, future research should focus on accelerating the training of PINNs. Distributed and parallel implementations can be of great help [ 55 , 21 ] in this direction. Another aspect with great room for improvement is related to architecture design. Even though effective modifications such as the modified MLP discussed in section 4 and in [ 12 ] can introduce noticeable gains in accuracy, a niche architecture similar to what convolutional networks have been for vision or Transformers for language processing, is yet to be discovered for solving PDEs. To this end, we must recognize that training a PINN model is fundamentally different from solving conventional supervised learning tasks, requiring us to design more effective architectures for minimizing PDE residuals in a self-supervised manner. We believe that addressing these open questions will become an important piece of the puzzle in advancing the use of physics-informed machine learning as a reliable analysis tool in computational science and engineering. 
+
+## Acknowledgements 
+
+We would like to acknowledge support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DE-SC0019116), the US Air Force (grant AFOSR FA9550-20-1-0060), and US Department of Energy/Advanced Research Projects Agency (grant DE-AR0001201). We also thank the developers of the software that enabled our research, including JAX [35], JAX-CFD[56], Matplotlib [57], and NumPy [58]. 
+
+## References 
+
+[1] Maziar Raissi, Alireza Yazdani, and George Em Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science , 367(6481):1026–1030, 2020. [2] Abhilash Mathews, Manaure Francisquez, Jerry W Hughes, David R Hatch, Ben Zhu, and Barrett N Rogers. Uncov-ering turbulent plasma dynamics via deep learning from partial observations. Physical Review E , 104(2):025205, 2021. [3] Georgios Kissas, Yibo Yang, Eileen Hwuang, Walter R Witschey, John A Detre, and Paris Perdikaris. Machine learning in cardiovascular flows modeling: Predicting arterial blood pressure from non-invasive 4D flow MRI data using physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering , 358:112623, 2020. [4] Alireza Yazdani, Lu Lu, Maziar Raissi, and George Em Karniadakis. Systems biology informed deep learning for inferring parameters and hidden dynamics. PLoS computational biology , 16(11):e1007575, 2020. [5] Sifan Wang and Paris Perdikaris. Deep learning of free boundary and Stefan problems. Journal of Computational Physics , 428:109914, 2021. [6] Khemraj Shukla, Patricio Clark Di Leoni, James Blackshire, Daniel Sparkman, and George Em Karniadakis. Physics-informed neural network for ultrasound nondestructive quantification of surface breaking cracks. Journal of Nondestructive Evaluation , 39(3):1–20, 2020. [7] Yuyao Chen, Lu Lu, George Em Karniadakis, and Luca Dal Negro. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Optics express , 28(8):11618–11633, 2020. [8] Francisco Sahli Costabal, Yibo Yang, Paris Perdikaris, Daniel E Hurtado, and Ellen Kuhl. Physics-informed neural networks for cardiac activation mapping. Frontiers in Physics , 8:42, 2020. [9] Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering , 384:113938, 2021. [10] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics , pages 1–19, 2021. [11] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics , 378:686–707, 2019. 17 A PREPRINT - M ARCH 16, 2022 [12] Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing , 43(5):A3055–A3081, 2021. [13] Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics , 449:110768, 2022. [14] Levi McClenny and Ulisses Braga-Neto. Self-adaptive physics-informed neural networks using a soft attention mechanism. arXiv preprint arXiv:2009.04544 , 2020. [15] Suryanarayana Maddu, Dominik Sturm, Christian L Müller, and Ivo F Sbalzarini. Inverse dirichlet weighting enables reliable training of physics informed neural networks. Machine Learning: Science and Technology , 2021. [16] Colby L Wight and Jia Zhao. Solving Allen-Cahn and Cahn-Hilliard equations using the adaptive physics informed neural networks. arXiv preprint arXiv:2007.04542 , 2020. [17] Mohammad Amin Nabian, Rini Jasmine Gladstone, and Hadi Meidani. Efficient training of physics-informed neural networks via importance sampling. Computer-Aided Civil and Infrastructure Engineering , 2021. [18] Jie Bu and Anuj Karpatne. Quadratic residual networks: A new class of neural networks for solving forward and inverse problems in physics involving PDEs. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM) , pages 675–683. SIAM, 2021. [19] Ameya D Jagtap, Yeonjong Shin, Kenji Kawaguchi, and George Em Karniadakis. Deep kronecker neural networks: A general framework for neural networks with adaptive activation functions. Neurocomputing , 468:165–180, 2022. [20] Senwei Liang, Liyao Lyu, Chunmei Wang, and Haizhao Yang. Reproducing activation function for deep learning. 
+
+arXiv preprint arXiv:2101.04844 , 2021. [21] Ameya D Jagtap and George Em Karniadakis. Extended physics-informed neural networks (XPINNs): A generalized space-time domain decomposition based deep learning framework for nonlinear partial differential equations. Communications in Computational Physics , 28(5):2002–2041, 2020. [22] Ben Moseley, Andrew Markham, and Tarje Nissen-Meyer. Finite basis physics-informed neural networks (fbpinns): a scalable domain decomposition approach for solving differential equations. arXiv preprint arXiv:2107.07871 ,2021. [23] Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. Journal of Computational Physics , 404:109136, 2020. [24] Aditi S Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. arXiv preprint arXiv:2109.01050 , 2021. [25] Revanth Mattey and Susanta Ghosh. A novel sequential method to train physics informed neural networks for allen cahn and cahn hilliard equations. Computer Methods in Applied Mechanics and Engineering , 390:114474, 2022. [26] Walter A Strauss. Partial differential equations: An introduction . John Wiley & Sons, 2007. [27] L.C. Evans and American Mathematical Society. Partial Differential Equations . Graduate studies in mathematics. American Mathematical Society, 1998. [28] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks , 9(5):987–1000, 1998. [29] Maziar Raissi, Hessam Babaee, and Peyman Givi. Deep learning of turbulent scalar mixing. Physical Review Fluids , 4(12):124501, 2019. [30] Ehsan Kharazmi, Zhongqiang Zhang, and George Em Karniadakis. Variational physics-informed neural networks for solving partial differential equations. arXiv preprint arXiv:1912.00873 , 2019. [31] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation . SIAM, 2008. [32] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. [33] Arieh Iserles. A first course in the numerical analysis of differential equations . Number 44. Cambridge university press, 2009. [34] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571–8580, 2018. 18 A PREPRINT - M ARCH 16, 2022 [35] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transforma-tions of Python+NumPy programs, 2018. [36] Lu Lu, Xuhui Meng, Zhiping Mao, and George E Karniadakis. DeepXDE: A deep learning library for solving differential equations. arXiv preprint arXiv:1907.04502 , 2019. [37] Oliver Hennigh, Susheela Narasimhan, Mohammad Amin Nabian, Akshay Subramaniam, Kaustubh Tangsali, Zhiwei Fang, Max Rietmann, Wonmin Byeon, and Sanjay Choudhry. Nvidia simnet ™ : An ai-accelerated multi-physics simulation framework. In International Conference on Computational Science , pages 447–461. Springer, 2021. [38] Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed DeepOnets. arXiv preprint arXiv:2103.10974 , 2021. [39] Sifan Wang and Paris Perdikaris. Long-time integration of parametric evolution equations with physics-informed deeponets. arXiv preprint arXiv:2106.05384 , 2021. [40] Sifan Wang, Hanwen Wang, and Paris Perdikaris. Improved architectures and training algorithms for deep operator networks. arXiv preprint arXiv:2110.01654 , 2021. [41] Zongyi Li, Hongkai Zheng, Nikola Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial differential equations. arXiv preprint arXiv:2111.03794 , 2021. [42] Yifan Du and Tamer A Zaki. Evolutional deep neural network. arXiv preprint arXiv:2103.09959 , 2021. [43] Shashank Reddy Vadyala, Sai Nethra Betgeri, and Naga Parameshwari Betgeri. Physics-informed neural network method for solving one-dimensional advection equation using pytorch. Array , 13:100110, 2022. [44] Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks. Journal of Computational Physics , 435:110242, 2021. [45] N Sukumar and Ankit Srivastava. Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks. arXiv preprint arXiv:2104.08426 , 2021. [46] Lu Lu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, and Steven G Johnson. Physics-informed neural networks with hard constraints for inverse design. arXiv preprint arXiv:2102.04626 , 2021. [47] Jesse Bettencourt, Matthew J Johnson, and David Duvenaud. Taylor-mode automatic differentiation for higher-order derivatives in jax. 2019. [48] Dimitris C Psichogios and Lyle H Ungar. A hybrid neural network-first principles approach to process modeling. 
+
+AIChE Journal , 38(10):1499–1511, 1992. [49] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 249–256, 2010. [50] Edward N Lorenz. Deterministic nonperiodic flow. Journal of atmospheric sciences , 20(2):130–141, 1963. [51] Yoshiki Kuramoto and Toshio Tsuzuki. Persistent propagation of concentration waves in dissipative media far from thermal equilibrium. Progress of theoretical physics , 55(2):356–369, 1976. [52] Gregory I Sivashinsky. Nonlinear analysis of hydrodynamic instability in laminar flames—i. derivation of basic equations. Acta astronautica , 4(11):1177–1206, 1977. [53] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. The Journal of Machine Learning Research , 19(1):932–955, 2018. [54] Michael Schaarschmidt, Dominik Grewe, Dimitrios Vytiniotis, Adam Paszke, Georg Stefan Schmid, Tamara Norman, James Molloy, Jonathan Godwin, Norman Alexander Rink, Vinod Nair, et al. Automap: Towards ergonomic automated parallelism for ml models. arXiv preprint arXiv:2112.02958 , 2021. [55] Khemraj Shukla, Ameya D Jagtap, and George Em Karniadakis. Parallel physics-informed neural networks via domain decomposition. arXiv preprint arXiv:2104.10013 , 2021. [56] Dmitrii Kochkov, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences , 118(21), 2021. [57] John D Hunter. Matplotlib: A 2D graphics environment. IEEE Annals of the History of Computing , 9(03):90–95, 2007. 19 A PREPRINT - M ARCH 16, 2022 [58] Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature ,585(7825):357–362, 2020. [59] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning , pages 1126–1135. PMLR, 2017. [60] Tobin A Driscoll, Nicholas Hale, and Lloyd N Trefethen. Chebfun guide, 2014. [61] Steven M Cox and Paul C Matthews. Exponential time differencing for stiff systems. Journal of Computational Physics , 176(2):430–455, 2002. 20 A PREPRINT - M ARCH 16, 2022 
+
+## A Nomenclature 
+
+Table 2 summarizes the main symbols and notations used in this work. Notation Description PDE Partial differential equation PINN Physics-informed neural network NTK Neural Tangent Kernel 
+
+u(·) solution of a PDE 
+
+N [·] a linear or non-linear differential operator 
+
+B[·] a boundary operator 
+
+uθ (·) neural network representation of the latent PDE solution 
+
+θ all trainable parameters of a neural network 
+
+Nt number of temporal collocation points 
+
+Nx number of spatial collocation points 
+
+wi residual weights at time ti
+
+ causality parameter 
+
+δ stopping criterion threshold for terminating a training loop 
+
+Lr (t, θ) temporal residual loss 
+
+L(θ) aggregate training loss Table 2: Nomenclature : Summary of the main symbols and notations used in this work. 21 A PREPRINT - M ARCH 16, 2022 
+
+## B Hyper-Parameters 
+
+Table 3 summarizes the network hyper-parameters for all numerical experiments. We tuned these hyper-parameters manually, without attempting to find the absolute best hyper-parameter setting. This process can be automated in the future leveraging effective techniques for meta-learning and hyper-parameter optimization [59]. Case Architecture Depth Width Nt Nx
+
+Allen-Cahn MLP 6 128 100 256 Modified MLP 6 128 100 256 Lorentz MLP 5 512 256 -Kuramoto–Sivashinsky (regular) Modified MLP 5 256 32 64 Kuramoto–Sivashinsky (chaotic) Modified MLP 10 128 32 256 Navier-Stokes Modified MLP 6 128 64 512 Table 3: Network architectures for each benchmark employed in this work. 
+
+## C Computational Cost 
+
+Training: Table 4 summarizes the computational cost of training PINNs. The size of different models as well as network architectures are listed table 3. All networks are trained using NVIDIA RTX A6000 graphics cards. Case Architecture # Time windows Max. Iterations Training time (iter/sec) Allen-Cahn MLP 1 3 × 10 5 120.30 Modified MLP 1 3 × 10 5 58.42 Lorentz MLP 40 1 × 10 5 957.41 Kuramoto–Sivashinsky (regular) Modified MLP 10 2 × 10 5 164.77 Kuramoto–Sivashinsky (chaotic) Modified MLP 5 2 × 10 5 28.22 Navier-Stokes Modified MLP 10 1 × 10 5 68.29 Table 4: Computational cost reported timings are obtained on NVIDIA RTX A6000 graphics cards. We remark that "Max Iteration" is the maximum iteration for every tolerance  in each time window. The default tolerance list is 
+
+[10 −2, 10 −1, 10 0, 10 1, 10 2] unless otherwise stated. The total number of iterations may vary for different examples due to the stopping criterion (see Algorithm 1). 22 A PREPRINT - M ARCH 16, 2022 
+
+## D Allen-Cahn equation 
+
+Validation: We solve the Allen-Cahn equation using conventional spectral methods. Specifically, assuming periodic boundary conditions, we start from the initial condition u0(x) = x2 cos( πx ) and integrate the system up to the final time 
+
+T = 1 . Synthetic validation data are generated using the Chebfun package [ 60 ] with a spectral Fourier discretization with 512 modes and a fourth-order stiff time-stepping scheme (ETDRK4) [61] with time-step size 10 −5.−1 0 1
+
+x
+
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0   
+
+> −101
+
+x
+
+> −1.0
+> −0.50.0
+> u(t, x )
+
+t = 0 .5  
+
+> −101
+
+x
+
+> −101
+> u(t, x )
+
+t = 1 .0
+
+Figure 15: Allen-Cahn equation: Top: Exact solution versus the prediction of a trained physics-informed neural network using Algorithm 1 and modified MLP. The resulting relative L2 error is 2.46 e − 04 . Bottom: Comparison of the predicted and exact solutions corresponding to the three temporal snapshots at t = 0 .0, 0.5, 1.0.0 1 2 3                 
+
+> Iteration ×10 5
+> 10 −7
+> 10 −5
+> 10 −3
+> 10 −1
+> 10 1
+> Loss
+> Lic
+> Lr
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 10 −6
+> 10 −4
+> 10 −2
+> 10 0
+> L(t, θ )
+> 0.00 0.25 0.50 0.75 1.00
+> t
+> 0.00.20.40.60.81.0
+> Temproal weights  w
+> Iter = 0 Iter = 1 ×10 3
+> Iter = 1 ×10 4
+> Iter = 1 ×10 5
+> Iter = 3 ×10 5
+
+Figure 16: Allen-Cahn equation: Left: Loss convergence of training a physics-informed neural network using Algorithm 1. Middle: Temporal residual loss L(t, θ) at different training iteration. Right: Temporal weights at different training iteration. 23 A PREPRINT - M ARCH 16, 2022 
+
+## E Lorentz system 
+
+Validation: The reference solution is obtained using scipy.integrate.odeint with default settings. 
+
+PINNs implementation: We split the whole domain [0 , 20] into 40 disjoint time windows of size ∆t = 0 .5. For each time window, we proceed by representing the latent variables of interest by a 5-layer fully-connected neural network 
+
+uθ with 512 neurons per hidden layer 
+
+t uθ
+
+−−→ [xθ , y θ , z θ ]. (E.1) Since Lorentz system is highly sensitive to the initial condition, we exactly impose the initial condition by 
+
+ˆxθ (t) = xθ (t) · t + x(0) , (E.2) 
+
+ˆyθ (t) = yθ (t) · t + y(0) , (E.3) 
+
+ˆzθ (t) = zθ (t) · t + z(0) . (E.4) Then the loss function can be reduced to the residual loss 
+
+Lr (θ) = 1
+
+NtNt∑
+
+> i=1
+
+wi
+
+∣∣∣∣
+
+dˆ xθ
+
+dt (ti) − σ (ˆ yθ (ti) − ˆxθ (ti)) 
+
+∣∣∣∣ (E.5) 
+
++ 1
+
+NtNt∑
+
+> i=1
+
+wi
+
+∣∣∣∣
+
+dˆ yθ
+
+dt (ti) − ˆxθ (ti)( ρ − ˆzθ (ti)) − ˆyθ (ti)
+
+∣∣∣∣ (E.6) 
+
++ 1
+
+NtNt∑
+
+> i=1
+
+wi
+
+∣∣∣∣
+
+dˆ zθ
+
+dt (ti) − ˆxθ (ti)ˆ yθ (ti) + β ˆzθ (ti)
+
+∣∣∣∣ , (E.7) where {ti}Nt 
+
+> i=1
+
+is a uniform grid in [0 , ∆t]. For this example, we set Nt = 256 and train the network with full-batch gradient descent. The temporal weights are updated by the proposed algorithm. 24 A PREPRINT - M ARCH 16, 2022 0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [0 .0, 0.5] 
+
+t ∈ [0 .5, 1.0] 
+
+t ∈ [1 .0, 1.5] 
+
+t ∈ [1 .5, 2.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [2 .0, 2.5] 
+
+t ∈ [2 .5, 3.0] 
+
+t ∈ [3 .0, 3.5] 
+
+t ∈ [3 .5, 4.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [4 .0, 4.5] 
+
+t ∈ [4 .5, 5.0] 
+
+t ∈ [5 .0, 5.5] 
+
+t ∈ [5 .5, 6.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [6 .0, 6.5] 
+
+t ∈ [6 .5, 7.0] 
+
+t ∈ [7 .0, 7.5] 
+
+t ∈ [7 .5, 8.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [8 .0, 8.5] 
+
+t ∈ [8 .5, 9.0] 
+
+t ∈ [9 .0, 9.5] 
+
+t ∈ [9 .5, 10 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [10 .0, 10 .5] 
+
+t ∈ [10 .5, 11 .0] 
+
+t ∈ [11 .0, 11 .5] 
+
+t ∈ [11 .5, 12 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [12 .0, 12 .5] 
+
+t ∈ [12 .5, 13 .0] 
+
+t ∈ [13 .0, 13 .5] 
+
+t ∈ [13 .5, 14 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [14 .0, 14 .5] 
+
+t ∈ [14 .5, 15 .0] 
+
+t ∈ [15 .0, 15 .5] 
+
+t ∈ [15 .5, 16 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [16 .0, 16 .5] 
+
+t ∈ [16 .5, 17 .0] 
+
+t ∈ [17 .0, 17 .5] 
+
+t ∈ [17 .5, 18 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [18 .0, 18 .5] 
+
+t ∈ [18 .5, 19 .0] 
+
+t ∈ [19 .0, 19 .5] 
+
+t ∈ [19 .5, 20 .0] 
+
+Figure 17: Lorentz system: Left: Loss convergence of training a physics-informed neural network using Algorithm 1 for every time window. 25 A PREPRINT - M ARCH 16, 2022 0 5 10 15 20 
+
+t
+
+> −10 010 20
+
+Predicted x(t)    
+
+> 0510 15 20
+
+t
+
+> −20
+> −10 010 20
+
+Predicted y(t)    
+
+> 0510 15 20
+
+t 
+
+> 010 20 30 40 50
+
+Predicted z(t)    
+
+> 0510 15 20
+
+t
+
+> 10 −4
+> 10 −2
+> 10 0
+
+Absolute error x(t)    
+
+> 0510 15 20
+
+t
+
+> 10 −4
+> 10 −2
+> 10 0
+
+Absolute error y(t)    
+
+> 0510 15 20
+
+t
+
+> 10 −4
+> 10 −2
+> 10 0
+
+Absolute error z(t)
+
+Figure 18: Lorentz system: Reference solutions versus the predicted solutions obtained by training a physics-informed neural network using Algorithm 1 with fixed iterations. 26 A PREPRINT - M ARCH 16, 2022 0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [0 .0, 0.5] 
+
+t ∈ [0 .5, 1.0] 
+
+t ∈ [1 .0, 1.5] 
+
+t ∈ [1 .5, 2.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [2 .0, 2.5] 
+
+t ∈ [2 .5, 3.0] 
+
+t ∈ [3 .0, 3.5] 
+
+t ∈ [3 .5, 4.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [4 .0, 4.5] 
+
+t ∈ [4 .5, 5.0] 
+
+t ∈ [5 .0, 5.5] 
+
+t ∈ [5 .5, 6.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [6 .0, 6.5] 
+
+t ∈ [6 .5, 7.0] 
+
+t ∈ [7 .0, 7.5] 
+
+t ∈ [7 .5, 8.0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [8 .0, 8.5] 
+
+t ∈ [8 .5, 9.0] 
+
+t ∈ [9 .0, 9.5] 
+
+t ∈ [9 .5, 10 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [10 .0, 10 .5] 
+
+t ∈ [10 .5, 11 .0] 
+
+t ∈ [11 .0, 11 .5] 
+
+t ∈ [11 .5, 12 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [12 .0, 12 .5] 
+
+t ∈ [12 .5, 13 .0] 
+
+t ∈ [13 .0, 13 .5] 
+
+t ∈ [13 .5, 14 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [14 .0, 14 .5] 
+
+t ∈ [14 .5, 15 .0] 
+
+t ∈ [15 .0, 15 .5] 
+
+t ∈ [15 .5, 16 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [16 .0, 16 .5] 
+
+t ∈ [16 .5, 17 .0] 
+
+t ∈ [17 .0, 17 .5] 
+
+t ∈ [17 .5, 18 .0] 
+
+0 1 2 3 4 5
+
+Iteration ×10 5
+
+10 −5
+
+10 −3
+
+10 −1
+
+10 1
+
+> Lr
+
+t ∈ [18 .0, 18 .5] 
+
+t ∈ [18 .5, 19 .0] 
+
+t ∈ [19 .0, 19 .5] 
+
+t ∈ [19 .5, 20 .0] 
+
+Figure 19: Lorentz system: Left: Loss convergence of training a physics-informed neural network using Algorithm 1 for every time window. 27 A PREPRINT - M ARCH 16, 2022 
+
+## F Kuramoto–Sivashinsky equation 
+
+Validation: For case I (regular), we solve the Kuramoto–Sivashinsky equation using conventional spectral methods. Specifically, assuming periodic boundary conditions, we start from the initial condition u0(x) = − sin( πx ) and integrate the Equation 5.4 up to the final time T = 1 . Synthetic validation data are generated using the Chebfun package [ 60 ] with a spectral Fourier discretization with 512 modes and a fourth-order stiff time-stepping scheme (ETDRK4) [ 61 ] with time-step size 10 −5. For case II (chaotic), we perform the same procedure with the initial condition 
+
+u0(x) = cos( x)(1 + sin( x)) . Then we select the numerical solution at t = 0 .5 as our initial condition for the PINNs simulation. 
+
+PINNs implementation: For Case I (regular), we split the temporal domain [0 , 1] into 10 time windows of size 
+
+∆t = 0 .1. Then we approximate the solution of each time window by a 5-layer modified MLP uθ with 256 neurons per hidden layer and encoded periodicity. It allows us to define the PDE residual by 
+
+R[uθ ] = ∂u θ
+
+∂t + αu θ
+
+∂u θ
+
+∂x + β ∂2uθ
+
+∂x 2 + γ ∂4uθ
+
+∂x 4 . (F.1) Then, we can formulate the following loss function 
+
+L(θ) = 1
+
+NtNt∑
+
+> i=0
+
+wiL(ti, θ), (F.2) where 
+
+L(t0, θ) = λic 
+
+1
+
+NxNx∑
+
+> j=1
+
+|uθ (0 , x j ) − u0(xj )|2 , (F.3) 
+
+L(ti, θ) = 1
+
+NxNx∑
+
+> j=1
+
+|R [uθ ]( ti, x j )|2 , for 1 ≤ i ≤ Nt. (F.4) Here we set Nt = 32 , N x = 64 and {ti}Nt
+
+> i=1
+
+, {xj }Nx 
+
+> j=1
+
+are randomly sampled in [0 , ∆t] and [−1, 1] , respectively at each iteration of gradient descent. Particularly, we take λic = 10 3 for better enforcing the initial condition. The network is trained by minimizing the above loss function via mini-batch gradient descent using the proposed algorithm. For Case II (chaotic): We split the temporal domain [0 , 0.5] into 5 time windows of size ∆t = 0 .1. Then we perform the same procedure except for employing a 10-layer modified MLP with 128 neurons per hidden layer and setting 
+
+λic = 10 4.
+
+Remark: For both cases, we employ Taylor-mode automatic differentiation [ 47 ] to accelerate the computation of high-order derivatives (see section 4). 28 A PREPRINT - M ARCH 16, 2022 0 2 4
+
+Iteration ×10 5
+
+10 −7
+
+10 −4
+
+10 −1
+
+> Lic
+
+0 2 4
+
+Iteration ×10 5
+
+10 −6
+
+10 −3
+
+10 0
+
+10 3               
+
+> Lr
+> t∈[0 .0,0.1]
+> t∈[0 .1,0.2]
+> t∈[0 .2,0.3]
+> t∈[0 .3,0.4]
+> t∈[0 .4,0.5]
+
+0 2 4
+
+Iteration ×10 5
+
+10 −7
+
+10 −4
+
+10 −1
+
+> Lic
+
+0 2 4
+
+Iteration ×10 5
+
+10 −6
+
+10 −3
+
+10 0
+
+10 3               
+
+> Lr
+> t∈[0 .5,0.6]
+> t∈[0 .6,0.7]
+> t∈[0 .7,0.8]
+> t∈[0 .8,0.9]
+> t∈[0 .9,1.0]
+
+Figure 20: Kuramoto–Sivashinsky equation (regular): Loss convergence of training a physics-informed neural network using Algorithm 1 for every time window. 0 2 4 6 8
+
+Iteration ×10 5
+
+10 −7
+
+10 −4
+
+10 −1
+
+> Lic
+
+0 2 4 6 8
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1               
+
+> Lr
+> t∈[0 .0,0.1]
+> t∈[0 .1,0.2]
+> t∈[0 .2,0.3]
+> t∈[0 .3,0.4]
+> t∈[0 .4,0.5]
+
+Figure 21: Kuramoto–Sivashinsky equation (chaotic): Loss convergence of training a physics-informed neural network using Algorithm 1 for every time window. 29 A PREPRINT - M ARCH 16, 2022 
+
+## G Navier-Stokes equation 
+
+Validation: We simulate two-dimensional decaying turbulence in a periodic box using the JAX-CFD [ 56 ] incom-pressible Navier-Stokes solver. A high-resolution validation data-set is created by simulating an initial divergence free velocity field with the given maximum velocity vmax = 5 . The flow is solved using a Fourier spectral collocation method on a 1024 × 1024 uniform mesh with a time step of dt = 10 −4 [56]. 
+
+PINNs implementation: Similar to the previous examples, the time domain [0 , 1] is decomposed into 10 time windows of size ∆t = 0 .1. We proceed by representing the velocity field by a 6-layer modified MLP with 128 neurons per hidden layer 
+
+[t, x, y ] uθ
+
+−−→ [uθ , v θ ]. (G.1) Then the vorticity can be approximated by wθ = ∂xvθ − ∂y uθ using automatic differentiation. Now we can define the PDE residual by 
+
+Rw 
+
+> θ
+
+= ∂w θ
+
+∂t + uθ
+
+∂w θ
+
+∂x + vθ
+
+∂w θ
+
+∂y − 1
+
+Re ( ∂2wθ
+
+∂x 2 + ∂2wθ
+
+∂y 2 ), (G.2) 
+
+Rc 
+
+> θ
+
+= ∂u θ
+
+∂x + ∂v θ
+
+∂y . (G.3) It allows to define the loss function by 
+
+L(θ) = 1
+
+NtNt∑
+
+> i=0
+
+wiL(ti, θ), (G.4) where 
+
+L(t0, θ) = λic 
+
+NxNx∑
+
+> j=1
+
+|uθ (0 , x j , y j ) − u0(0 , x j , y j )|2 (G.5) 
+
++ |vθ (0 , x j , y j ) − v0(0 , x j , y j )|2 (G.6) 
+
++ |wθ (0 , x j , y j ) − w0(0 , x j , y j )|2 (G.7) and 
+
+L(ti, θ) = λw
+
+NxNx∑
+
+> j=1
+
+|R w 
+
+> θ
+
+(ti, x j , y j )|2 + λc
+
+NxNx∑
+
+> j=1
+
+|R c 
+
+> θ
+
+(ti, x j , y j )|2 , for 1 ≤ i ≤ Nt. (G.8) For this example we set Nt = 64 , N x = 512 and λw = 1 , λ c = 10 2, λ ic = 10 4. The temporal and spatial collocation points are randomly sampled from [0 , 1] and [0 , 2π]2, respectively. It is worth noting that we also enforce the initial velocity field (u0, v 0) as additional constraints for better convergence. This is not a severe restriction since the velocity field can be obtained from the vorticity by solving the associated Poisson’s equation or from the network representation directly. Furthermore, in Appendix we also present our results simulating the turbulent flow up to T = 2 . Figure 26 presents the visualizations of the predicted velocity and vorticity field at the final state. The predictive accuracy is quantified in Figure 27. Although the resulting relative L2 error is above 10% , our model predictions seem to be qualitatively correct against the corresponding ground truth. 30 A PREPRINT - M ARCH 16, 2022 
+
+Figure 22: Navier-Stokes: Representative snapshots of the predicted u against the ground truth at t = 0 .2, 0.4, 0.6, 0.8.31 A PREPRINT - M ARCH 16, 2022 
+
+Figure 23: Navier-Stokes: Representative snapshots of the predicted v against the ground truth at t = 0 .2, 0.4, 0.6, 0.8.32 A PREPRINT - M ARCH 16, 2022 
+
+Figure 24: Navier-Stokes: Representative snapshots of the predicted w against the ground truth at t = 0 .2, 0.4, 0.6, 0.8.33 A PREPRINT - M ARCH 16, 2022 0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .0, 0.1] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .1, 0.2] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .2, 0.3] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .3, 0.4] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .4, 0.5] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .5, 0.6] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .6, 0.7] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .7, 0.8] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .8, 0.9] 
+
+0 1 2 3
+
+Iteration ×10 5
+
+10 −5
+
+10 −2
+
+10 1
+
+> Loss
+
+t ∈ [0 .9, 1.0]     
+
+> Lu0Lv0Lw0Lrw Lrc
+
+Figure 25: Navier-Stokes: Loss convergence of training a physics-informed neural network using Algorithm 1 for every time window. 34 A PREPRINT - M ARCH 16, 2022 
+
+Figure 26: Navier-Stokes: Predicted u, v, w against the ground truth at t = 2 .0 1 2
+
+t
+
+> 10 −3
+> 10 −2
+> 10 −1
+> Rel.  L2 error
+
+u(t, x, y )  
+
+> 012
+
+t
+
+> 10 −3
+> 10 −2
+> 10 −1
+> Rel.  L2 error
+
+v(t, x, y )  
+
+> 012
+
+t
+
+> 10 −3
+> 10 −2
+> 10 −1
+> Rel.  L2 error
+
+w(t, x, y )
+
+Figure 27: Navier-Stokes: Relative L2 errors of u, v, w , respectively. 35
diff --git a/pinn/docs/evidence/wang2022_expert_guide_pinn.md b/pinn/docs/evidence/wang2022_expert_guide_pinn.md
new file mode 100644
index 0000000..c606647
--- /dev/null
+++ b/pinn/docs/evidence/wang2022_expert_guide_pinn.md
@@ -0,0 +1,1983 @@
+Title: 2208.09123v3.pdf
+
+URL Source: https://arxiv.org/pdf/2208.09123
+
+Published Time: Mon, 23 Jan 2023 17:11:52 GMT
+
+Number of Pages: 68
+
+Markdown Content:
+# IAN: Iterated Adaptive Neighborhoods for manifold learning and dimensionality estimation 
+
+## Luciano Dyballa 1 and Steven W. Zucker 1,2 
+
+> 1
+
+Department of Computer Science, Yale University, New Haven, CT 
+
+> 2
+
+Department of Biomedical Engineering, Yale University, New Haven, CT 
+
+Abstract 
+
+Invoking the manifold assumption in machine learning requires knowledge of the mani-fold’s geometry and dimension, and theory dictates how many samples are required. However, in most applications the data are limited, sampling may not be uniform, and the manifold’s properties are unknown; this implies that neighborhoods must adapt to the local structure. We introduce an algorithm for inferring adaptive neighborhoods for data given by a similar-ity kernel. Starting with a locally-conservative neighborhood (Gabriel) graph, we sparsify it iteratively according to a weighted counterpart. In each step, a linear program yields minimal neighborhoods globally, and a volumetric statistic reveals neighbor outliers likely to violate manifold geometry. We apply our adaptive neighborhoods to non-linear dimensionality re-duction, geodesic computation, and dimension estimation. A comparison against standard algorithms using, e.g., k-nearest neighbors, demonstrates the usefulness of our approach. 
+
+Research supported by NIH Grant EY031059, NSF CRCNS Grant 1822598, and the Swartz Foundation. This project derived from problems in manifold inference involving neuroscience data. We thank G. Field and M. Stryker for motivating discussions. 
+
+# 1 Introduction 
+
+A starting point for many algorithms in data science—from clustering to manifold inference— is knowing the neighbor relationships among data points. Clustering, for example, often begins with a “ k-nearest neighbor graph,” while manifold inference involves a kernel, i.e., a measure of similarity between data points. In the first case, the neighborhoods are local and discrete; in the second, they are global and continuous, with concentration of influence controlled by the kernel bandwidth, or scale. Such neighbor relationships are fundamental to defining a topology. More-over, dimensionality may be estimated based on the rate of change in the density of points within a ball, that is, within a neighborhood, with respect to its radius. It is helpful when the number of data points is large, a requirement that grows with dimensionality; asymptotic analysis is often favored by theoreticians. 1
+
+> arXiv:2208.09123v3 [cs.LG] 7 Jan 2023
+
+In practice, we rarely have enough data points to satisfy asymptotic bounds. Nor are we given the precise number of neighbors, k, that each point should have. We often make the manifold assumption—that the data points are drawn randomly from a (or near a) manifold—but rarely try to assess the basic properties of the manifold assumed by theorists: its dimensionality, sampling density, curvature, medial axis, or reach (defined in the next section). All of these could influence 
+
+k.Instead, we rely on different visualization algorithms, such as Isomap, diffusion maps, t-SNE, and many others (references in the next section), to find a pleasing organization of the data. This is dangerous, of course, because these algorithms have free parameters. In particular, and central to this paper, most require specifying the number of neighbors, k (or its equivalent): changing k or other parameters changes the result. Unless one knows the answer, one is caught in a conundrum: imposing a prior belief amounts to “fixing” the solution (examples of changing k are shown later in the paper). This gap between theory and practice shows up right from the start. If the manifold is not pure, i.e., if it consists of a union of manifolds of possibly different dimensionality, then there may be no global k that suffices; furthermore, the manifold may have a boundary. Even if it is pure and without boundary, the temptation to choose k large is common. But this can incorrectly fill in the open space around curved manifolds (“folding”, or “short-circuiting”), linking distant points that should not be neighbors. On the other hand, choosing k small can induce holes and break connectivity. Such phenomena are illustrated in Figure 1. As we shall demonstrate, sampling issues and manifold geometry interact in causing these. Moreover, in real datasets the appropriate number of neighbors may differ from point to point. This final issue is a principal motivation for this paper. We present an algorithm to estimate an effective neighborhood—the immediate neighbors, or scale of a similarity kernel—around each point. We seek to identify those nearest neighbors that are “correct” in the sense that they support dimensionality and volume estimates, and manifold inference in general, without covering holes or filling in concavities. It is inspired by the philo-sophical position that views discrete and continuous mathematics as “two sides of the same,” as argued by Lov´ asz [74], and iterates between them. Our algorithm builds from a conservative initial estimate of neighbors (based on a discrete construct, the Gabriel graph) toward a refined one, based on continuous estimates from a multiscale Gaussian kernel. The discrete and continuous volume estimates must be consistent, however, and this provides the glue for our iteration. Since not all of the initial putative neighbors may actually be closest neighbors, those neighbors that violate the volume relationship are pruned, and the process repeats until the two perspectives agree. Our algorithm, thus, can be considered an iterative graph sparsification. Technically, it involves two different graphs: a discrete one, that links only putative nearest neighbors (pairs of points defining the diameter of an otherwise empty ball), and a weighted one, structured by a multiscale Gaussian kernel, whose individual scales must cover the neighborhood given in the discrete graph. Keeping the two graphs consistent is another way to think about our iteration. Each resulting graph can be applied to many different algorithms for data visualization, dimensionality reduction, and manifold inference. Our approach to the problem is in the spirit of exploratory data analysis ; it works with the available data. This provides another view regarding the interaction between sampling and geom-etry: one can only do as well as the available data allow (see Figure 2). The situation is analogous 2mfds Figure 1: Inferring the geometry of manifolds requires neighborhoods around each given data point. Setting the correct scale for these neighborhoods, shown as balls, is fundamental. (A)Example of a 1-dimensional manifold, M. ( B) Collection of points sampled from an unknown distribution over M. Their pairwise distances are the only available data; properties of M are not given a priori . ( C) Using a global kernel scale: if it is too small, the manifold will appear disconnected, artificially producing clusters. Notice how some balls do not touch. (D) If it is too big, the manifold may collapse, giving rise to incorrect geometry/topology. Notice how the balls overlap (covering dimension). (E) The use of local scales based on a global number of nearest neighbors (in this example, k = 2) is still susceptible to the problems above. (F) Our approach computes locally adaptive neighborhood sizes, resulting in scales that conform to the local geometry and sampling. 3Figure 2: Sampling a Swiss cheese: the available data constrain manifold complexity. Viewed from left to right, as the number of sample points increases, the apparent manifold goes from a plane to a plane with holes. The central panel shows the sampling density for which actual holes in the manifold become roughly distinguishable from holes due to sampling. The results of our algorithm on these examples are shown in Figure 23. to that in learning theory, where there is a trade-off between the accuracy of the learner and the coarseness of the hypothesis class over which she is learning [2]. Here, the space of manifolds over which inferences are made is dictated by the available samples. An overview of the paper is as follows. In the next section, we review the background in some detail, covering both the zoo of similarity kernels that exist plus several relevant notions, such as the reach of a manifold, that are well studied in the theory literature. The discussion is organized to emphasize the centrality of scale, or neighborhood, in all of the references. In section 3 we provide an overview of our algorithm. It includes a brief sketch of both graphs we work with, plus the connection back to manifolds. Pseudocode for the algorithm is given in Algorithm 1, which also includes pointers to where each of its steps is developed. We then expand on the algorithm. In section 3.3 we study the Gabriel graph and putative neighbors. Two features are emphasized: scale-free neighborhoods and the relationship between node degree and local dimensionality. A structural criterion is revealed, showing how putative edges between neighbors fill “volumes” that block others from being neighbors. This graph serves as an initialization. It is then refined iteratively in several steps. First, continuous kernel scales are computed based on the discrete, putative neighbors. A linear program relaxation bridges local scales to a global cover, in which each node’s weighted degree is comparable to the number of its neighbors. In other words, each neighborhood radius should not cover too many outside points. If it does, then it indicates that the neighborhood itself should be refined. That is, some putative scales are likely wrong, in the sense that their neighborhood contains an extreme outlier. This leads directly to a volumetric statistic (section 3.5.1), and to a pruning technique for sparsifying edges from the discrete graph. The process iterates until there are no more outliers. In section 4, we evaluate the results for estimating manifold low-dimensional embeddings, geodesics, and local intrinsic dimensionality. Comparisons against popular algorithms, such as UMAP and t-SNE, illustrate the power of the approach. In the end, we demonstrate that it is possi-ble to infer data-driven local neighborhoods that remain consistent with geometric and topological properties of manifolds. Code for our algorithm is available at github.com/dyballa/IAN .42 Background 
+
+Manifold learning is a vast area of machine learning where high-dimensional data are analyzed based on the assumption that they were sampled from a low-dimensional manifold, M [42], in which case geodesic distances over M provide a better description of the relationships between data points than Euclidean distances in ambient space [11]. The manifold assumption finds appli-cations in non-linear dimensionality reduction [102], de-noising [56], interpolation [23], dimen-sionality estimation [25], computational geometry [32], and more. Since M locally resembles Euclidean space, it is standard to define a similarity kernel to define (possibly weighted) neighborhoods around each point in terms of other points. This naturally leads to a graph having data points as nodes and similarity values as edge weights. Then, by computing the graph Laplacian, one can apply a variety of methods from spectral graph theory [see, e.g., 94]. Formal analysis involves the limit as the number of data points grows large; the practical success of such methods depends on how well graph neighborhoods capture the topology and geometry of 
+
+M.We here review the many approaches to specifying a similarity kernel or a local neighborhood. Let M be a d-dimensional manifold in ambient space Rn. When only pairwise distances are known, an intuitive approach is to define the neighbors of a point, xi, as those within a certain distance threshold, or, equivalently, inside an n-dimensional ball around xi. A kernel function assumes the role of this ball, by assigning values to neighboring points as a function (discrete or continuous) of how close they are to xi. The question becomes: what kernel size should be used for each point? 
+
+## 2.1 Similarity kernels 
+
+Consider a set of points X ∈ Rn. Typically, a symmetric, positive semi-definite similarity ker-nel [93] is chosen to determine weighted connections between data points based on the ambient Euclidean distances between them. For each pair of data points xi, xj ∈ Rn, it returns a number between 0 and 1 which determines how close, or strongly connected, they are. This effectively defines a neighborhood around each point. 
+
+2.1.1 Discrete kernels 
+
+Possibly the simplest choice for a kernel is the ε-neighborhood [e.g., 10]: 
+
+Kij (ε) = 
+
+{1, if ‖xi − xj ‖ < ε 
+
+0, otherwise , (1) where ‖ · ‖ is typically the Euclidean norm in Rn. This results in discrete-like neighborhoods whose sizes may be quite sensitive to the choice of ε, so implicit is the assumption that sampling is approximately uniform. Instead of defining a neighborhood radius, a more common approach is to specify the number of neighboring points, k. Letting Nk(xi) be the set containing the k points closest to xi in Rn (not 5including xi1), a k-nearest neighbors kernel can be defined as: 
+
+Kij (k) = 
+
+{1, if xj ∈ N k(xi)0, otherwise , (2) which is commonly symmetrized by making Kij (k) = 1 if xj ∈ N k(xi) ∨ xi ∈ N k(xj ).
+
+2.1.2 Continuous kernels – global scale 
+
+In order to have the kernel values decrease with increasing distance between data points, a Gaussian kernel is commonly used: 
+
+Kij (σ) = exp 
+
+(
+
+−‖ xi − xj ‖2
+
+σ2
+
+)
+
+. (3) This gives a continuous similarity scale from 1 (when xi and xj are identical) down to some predetermined cutoff below which the kernel is considered to be zero (meaning no connection in the data graph). Such a threshold is typically chosen to be a very small value, often at the limit of numerical precision, and is often required to ensure compactness of the kernel. One would like the parameter σ to be just large enough to be able to capture local manifold patches. There are several heuristics for finding such a scale: the median of all pairwise distances in X (or another percentile), the mean (or median) of the distances to each point’s kth nearest neighbor [66], or a scalar multiple of the maximal distance from a point to its nearest neighbor in the data [59]. Also common is to choose a scale so that each data point is sufficiently connected to at least one other point [67]. A different approach is based on inspection of the curve given by the sum of pairwise kernel values. When the double-sum ∑ 
+
+> i,j
+
+Kij (σ) is plotted against σ using a log-log scale, the slope 
+
+d log ∑ 
+
+> i,j
+
+Kij (σ)d log σ (4) is proportional to the intrinsic dimensionality of the data [30]. A global scale is then chosen from within a linear region of such curve. In [52], a similar procedure is proposed that considers, instead, the curve given by the weighted average of the degrees Zi(σ) = ∑ 
+
+> j
+
+Kij of each data point xi, after taking the logarithm: 
+
+〈log Zi(σ)〉 =
+
+∑ 
+
+> i
+
+log Zi(σ) · (1 /Z i(σ)) 
+
+∑
+
+> i
+
+(1 /Z i(σ)) , (5) The use of the inverse of each point’s degree as weights is intended to compensate for density heterogeneities. The choice of σ is then made precise by choosing the argmax of the slope of 
+
+〈log Zi(σ)〉 plotted against log σ, which in many cases should occur near the center of the lin-ear region of equation 4. One complication occurring in both approaches, however, is that more than one linear section (and, equivalently, more than one local maximum of the slope) may exist, requiring that additional criteria be defined to make the choice of σ truly automated.  
+
+> 1Throughout, when referring to a point’s set of k-nearest neighbors, we shall not include the point itself (unless otherwise stated), and further assume that no two points are identical.
+
+62.1.3 Continuous kernels – multiscale 
+
+A more localized strategy is to use a multiscale kernel , where each point has an individual scale, or bandwidth. Instead of a single, global scale, there are now N parameters. The advantage is that, if the scale selection is adequate, the kernel may capture the characteristics of more complex datasets and manifolds that have non-uniform sampling and geometry. In the self-tuning method [111], local scales are used in a Gaussian kernel by replacing the global scale σ, from equation 3, by √σiσj , where σi and σj are the scales assigned to xi and xj ,respectively. This results in the symmetric kernel: 
+
+Kij (σi, σ j ) = exp 
+
+(
+
+−‖ xi − xj ‖2
+
+σiσj
+
+)
+
+. (6) Each σi is set as the distance to the kth nearest neighbor of xi; authors recommend k = 7 [111, 79]. In [17] and [16], a variable bandwidth kernel is proposed that combines the use of local band-widths with a global scale parameter, . The kernel then takes the form: 
+
+Kε (xi, xj ) = exp 
+
+(
+
+−‖ xi − xj ‖2
+
+4(q(xi)q(xj )) β
+
+)
+
+, (7) where q is a local density function and β an additional (non-positive) parameter. An initial estimate for the local bandwidth around each point xi is set as the square-root of the mean squared distance to the k-nearest neighbors of xi, with k = 8. Finally,  is automatically tuned as the argmax of equation 4 above; however, the authors do not consider cases in which more than one local maximum may exist. Other methods also adopt individual bandwidth parameters, but use asymmetric kernels that are symmetrized a posteriori . In the t-SNE algorithm [101], the single-scale Gaussian kernel 
+
+Kij (σi) = exp 
+
+(
+
+−‖ xi − xj ‖2
+
+2σ2
+
+> i
+
+)
+
+(8) gives a measure of affinity, or similarity, between pairs of points. It is then normalized as 
+
+pj|i(σi) = Kij (σi)
+
+∑  
+
+> k6=i
+
+Kik (σi) (9) to yield transition probabilities, and finally symmetrized as 
+
+pij (σi, σ j ) = 12N
+
+(pj|i(σi) + pi|j (σj )) . (10) Each σi is fit to xi so that the distribution of pj|i, ∀j attains entropy Hi such that its perplexity, 
+
+2Hi (a real-valued number representing the “effective number of neighbors”), approximates some prespecified value, k. The authors recommend a value for k between 5 and 50. In the UMAP algorithm [77], an exponential kernel is used instead of the typical Gaussian. Using a prespecified neighborhood size, k, let Nk(i) be the set of k-nearest neighbors of xi. With 
+
+ρi as the distance to the nearest neighbor of xi, the kernel has the form 
+
+Kij (σi) = exp 
+
+(−max {0, ‖xi − xj ‖ − ρi}
+
+σi
+
+)
+
+, j ∈ N k(i), (11) 7and is symmetrized as 
+
+Uij (σi, σ j ) = Kij (σi) + Kji (σj ) − Kij (σi)Kji (σj ). (12) It can be seen as a hybrid between continuous and discrete, since Uil is set to zero for any point xl
+
+not in Nk(i). Each σi is fit to xi so that ∑ 
+
+> j
+
+Kij (σi) approximates log 2 k (loosely analogous to the perplexity approach from t-SNE). 
+
+2.1.4 Adaptive neighborhood size methods 
+
+Other methods attempt to automatically determine optimal neighborhoods. Most of these are based on determining an optimal k for a k-nearest neighbors ( k-NN) graph; this can be done either globally or by selecting a local neighborhood size ki around each point xi, known as adaptive neighborhood selection [102]. Some approaches optimize a global k based on its performance in a specific embedding algo-rithm. For instance, the method from [91] is tailored to Isomap [96], while others [65, 3] apply to LLE [88]. In [3], a local method is additionally proposed that produces a nearest-neighbor graph with variable ki, under the assumption that the manifold is connected. Others are based on first estimating the local tangent space around each point, then setting ki to include as neighbors those points that are close to it. Such methods [e.g., 106, 78] typically work with positional information for the tangent space computation (usually via SVD). Also available are methods that are not based on the nearest-neighbors concept. In computa-tional geometry, the idea of refining an initial estimate of connectivity from a simplicial mesh has been used before, usually specific to the case when d = 2 and n = 3, i.e., surfaces in 3-D space [5, 4, 14, 12]. Other approaches extend this idea to arbitrary dimension [13, 19], but still require knowledge of d. Most of the algorithms in this class use point clouds as input, so they can exploit positional information to decide on the appropriate neighborhood/connectivity. Among the myriad ways of estimating neighborhoods, there is little agreement on which is most successful; see [71] for a review. Before proceeding to our algorithm, then, it is helpful to first understand what makes this such a hard problem. How can it fail, and what requirements must it fulfill in order to properly capture the topology and geometry of M? This brings us to the geometry of manifolds. 
+
+## 2.2 Reach and the geometry of manifolds 
+
+The neighborhoods implied by a kernel should agree with M, or at least approximate a tubu-lar neighborhood of it. As exemplified in Figure 1, if neighborhoods are too small, the implied manifold may become disconnected, i.e., falsely divided into disjoint sub-manifolds or clusters [91]; if too large, they may cause M to self-intersect, collapsing bottlenecks or curved regions, or cause “smoothing,” or “folding.” Such shortcomings are well-known in the manifold infer-ence literature—while the former case typically occurs due to non-uniform sampling, the latter is mainly caused by an incompatibility between the sampling rate and the reach of M [41, 97]. We now expand on these points. Letting the medial axis of M be the set of points in Rn with at least two closest points in 
+
+M, the reach, τ , can be defined as the minimum distance from M to its medial axis. Locally, it is constrained by the minimal radius of curvature (i.e., maximal curvature of a geodesic through 8M); globally, it is constrained by the presence of bottlenecks (Figure 3). The reach encodes essential geometric properties of M, and has been widely used in the manifold learning community [4, 12, 82, 19, 83, 49, 73, 43, 1, 20]. It approximates the size of the largest ball in ambient Rn such that points in M can be seen as lying in Euclidean space Rd [18]. A related concept, the local feature size of a point xi ∈ M , is the smallest distance between xi and the medial axis of M, so 
+
+τ can be seen as the infimum of the local feature size anywhere on M [13]. 
+
+Figure 3: The reach, τ , is a measure of the shape of a manifold. ( A) A 1-dimensional manifold M
+
+with a bottleneck; the reach (double arrow) is the smallest distance between M and its medial axis (dashed curves). ( B) A highly curved manifold; now the reach indicates the high curvature region. When τ is positive, it provides a measure of the “local distortion” [18]; the larger it is, the easier inference becomes. Some authors [e.g., 81, 42] assume large reach in order to test the manifold hypothesis and to find bounds on the required sample size. In [18], the reach is used when establishing bounds on the quality of an intrinsic dimensionality estimation based on k-nearest neighbors. Obtaining a good representation of M, therefore, requires consideration of its reach. In terms of our problem of finding an appropriate kernel, this effectively means that no neighborhood radius should cross the medial axis of M.Sampling is a further complication, and essentially what makes this a hard problem: when it is nonuniform and sparse (common in real-life datasets), it is not always clear whether the space between points constitutes an undersampled piece of M, a hole, or a gap between disjoint subman-ifolds (cf. Figure 2). The latter two conditions, of course, relate to reach. Narayanan and Mitter [81] prove that the number of required samples depends polynomially on curvature, exponentially on intrinsic dimension, and linearly on intrinsic volume. Aspects of our algorithm address each of these during the iteration process. In all such cases, choosing a globally-fixed radius is likely to be problematic. While defining neighborhood size based on a fixed number k of neighbors can be helpful to deal with nonuniform density (since the neighborhood radius adapts to the local pairwise distances), it is bound to violate the reach if k is too large. It will also be a problem when the intrinsic dimensionality is not constant throughout M, as higher dimensions require exponentially more neighbors. Mekuz and Tsotsos [78] point out the lack of a principled way for setting this parameter, which in practice is often tuned empirically based on prior knowledge of the desired output. As put by Wang et al. [106], the effectiveness of manifold learning algorithms depends on how nearby 9neighborhoods overlap and on the interplay between the curvature of the manifold and sampling density. In terms relevant to this paper, the neighborhood radius should be smaller than the local feature size, but large enough to account for sampling variability and local dimensionality. We propose an iterative approach to developing a kernel, so that it can adapt appropriately to the neighborhood characteristics around each point. 
+
+# 3 The algorithm 
+
+We here overview our algorithm for finding the neighborhood scale around each point in a manner that makes it globally consistent as a covering of the data points. As is common in manifold learning, we start with a pairwise distance matrix, not the points themselves. The first step is to build a graph in which each datum is connected to an appropriate neighborhood containing other data points. This data graph defines a topology; we refer to it as the neighborhood graph .As we reviewed above, in the discrete case one might choose k-nearest neighbors, while in the continuous kernel case there is a bandwidth parameter that effectively defines a “ball of influ-ence” around each point. Scale is the radius of such a ball; a level set of the kernel function that essentially contains those neighbors whose weights are non-trivial. Our goal, then, is to find those scales—or neighborhoods—that support non-linear dimensionality reduction, geodesic estimation and, in general, manifold inference from the given pairwise distances. We do not have sampling guarantees, so will develop a statistic to check whether reach and curvature constraints might be violated. 
+
+## 3.1 Subtleties of scale 
+
+Since scale may not be constant across the data set, we argue that it should be the first property to be inferred from the data. We start by imposing the manifold assumption, but from an empirical perspective. Unlike most theoretical studies, we do not assume the manifold is pure, i.e., that it has constant dimension. In a simple case, the data may be drawn from a union of different mani-folds whose dimensions are not known a priori —such datasets have been considered infrequently, although exceptions exist [e.g., 53, 73]. Second, we do not know the sampling rate, or density. Rather, we build it up, conservatively, with putative nearest neighbors to each data point, by imposing a necessary (but not sufficient) condition. These putative neighbors will be refined, as the algorithm iterates, to achieve sufficiency. While the manifold assumption does imply the existence of local neighborhoods, their size may vary over the dataset; we require that the sampling be nearly constant over each of them. In effect, the density of points must be determined locally while respecting the global manifold geometry. We illustrate the complexity of this situation in Figure 4. Shown is a data sphere with an apparent spike emerging from it. On one hand, such complex datasets could derive from two unrelated systems, which only appear to connect through their embeddings. On the other hand, the data could derive from a non-linear system that includes two regimes, one responsible for the spherical data and the other for the spike. To handle the first situation, we must allow datasets to consist of unions of manifolds. This suggests the interpretation in Figure 4-B, where the separation is obscured by sampling. Since manifolds with boundary and high curvature are also possible, the 10 situation in Figure 4-C arises. There is an apparent change in intrinsic dimension due to the small reach in the spike and the large boundary curvature. Because the (3-D) spike is so narrow, sampling suggests it is 1-dimensional, while the bulk of the points derive from a 3-D manifold. We submit that such situations occur in real datasets and, since the data are fixed, we cannot appeal to knowing the sampling density or the manifold dimensions and reach. Instead, we address the interplay between manifold reach and sampling density pragmatically. Along the spike, the data appear to be 1-D; in the ball, 3-D. We seek a neighborhood graph that supports these inferences, so “most” points enjoy a neighborhood that agrees with their apparent dimension. At the join (or high-curvature neck), it is unclear. Moving from the spike to the ball suggests that dimension should be increasing; from the ball to the spike, it should be decreasing. For the neighborhood graph, most points along the spike should see ∼2 neighbors, and most points in the ball should see ∼23 neighbors; the problematic points should see something intermediate. Such results will be shown to follow from our algorithm. We claim that either of the alternatives is worse; one should not impose an apparent dimension-ality (or connectivity in the neighborhood graph) globally. To wit, if small numbers of neighbors (appropriate for the spike) are enforced on the ball, then holes are likely to be introduced. Or, if too many neighbors are enforced on the spike, it will collapse on itself. Both change the topology drastically (these situations are illustrated later, in Figures 24–25). Figure4 subtleties-new 
+
+Figure 4: The “manifold” subtleties of complex datasets. (A) Sampled data from a non-linear system that includes two regimes. (B) It may be the case that the data in each regime define separate manifolds, shown by color. After sampling their union, however, the evidence for the separation is absent. (C) Or the data may be drawn from a single, connected manifold whose geometric properties change rapidly. In both cases, the intrinsic dimensionality appears different in the spike vs. the ball. Colored meshes indicate underlying manifolds. 
+
+## 3.2 Overview of the algorithm 
+
+Let the dataset, X , be a sampling of a (possibly non-pure) manifold M = ∪αMα, with the dimension of each component Mα denoted by dα. It consists of N points in ambient space Rn,where n ≥ dα, ∀α. The manifold may have a boundary, and the number of components is not known a priori .We work with two graphs: the first unweighted, and the second with edge weights given by a kernel. Our strategy is to begin with a conservative estimate of the unweighted graph, and extend it to a global weighted graph that suggests an estimated manifold covering. The validity of this 11 extension is evaluated by a measure of volume in both graphs; an iterative algorithm is used to infer individual local scales for each point xi. Before presenting the algorithm, we introduce the two graphs. Let the unweighted graph be G = ( V, E ), with |V | = N and adjacency matrix A with entries 
+
+aij , where to each point xi ∈ X is associated a node i ∈ V . We denote its initial estimate by G(0) ;successive refinements are indicated as G(t) until convergence ( G?). Since we seek a scale for each data point, we work with a multiscale Gaussian similarity kernel, defined as in section 2.1: 
+
+Kij = exp 
+
+(−‖ xi − xj ‖2
+
+σiσj
+
+)
+
+. (13) The kernel value Kij is therefore symmetric and equivalent to that of a traditional Gaussian kernel (equation 3), except using the geometric mean of σi and σj as its scale. Notice, in particular, how the scales and the kernel value are coupled: setting the scale incorrectly could make distant points 
+
+xi and xj appear close in similarity. Given a set of individual point scales σi (sometimes collected into the vector σ ∈ RN ), we define a second, weighted graph G = ( V, E, W ) as the complete graph on all pairs of data points in X . Its weighted adjacency matrix, W , has entries wij = Kij .While the unweighted graph will be related to nearest neighbors and computational geometry, the weighted graph will be related to spectral methods on manifold inference. In particular, we expect the Laplacian of G to approximate the Laplace-Beltrami operator on M, subject to the number of data points and their sampling. The algorithm is initialized by computing a coarse estimate of G. As described later in sec-tion 3.3, this is achieved by exploiting the geometry of medial balls between pairs of points to produce a Gabriel graph [48, 76]. A Gabriel graph is that in which there is an edge between two points xi and xj if and only if they are the only two closest points to the midpoint of the line segment joining them. The main advantages of using a Gabriel graph as a starting point are: ( i)it is scale invariant, so a prespecified ε-neighborhood (equation 1) is not required; ( ii ) there is no global constant k (it can vary); and ( iii ) neighbors are not limited to the closest neighbors in ambi-ent space. Thus, it allows for connections to “jump across” sampling gaps while keeping the data graph sparse. However, as described in section 2.2, obtaining a good inference of M amounts to finding reasonable estimates of its reach and local feature size. For that to occur, no edge segment `ij 
+
+between two points xi and xj should cross a medial axis of M. As the examples that follow will show, there are several cases in which the Gabriel graph will violate this. Therefore, additional steps are necessary to refine it. The Gabriel graph provides a necessary condition (all the correct connections are present, but possibly others as well); our refinement moves toward sufficiency. In order to estimate G—the weighted counterpart of G—we will use the weights that are ob-tained by applying a continuous kernel (equation 13) over the points in X . Such a kernel requires scales, or bandwidths, σ that must be estimated from G. These will be obtained from an optimiza-tion procedure that finds the smallest such scales ensuring that all discrete edges have a minimum kernel value as weight. At this point, a weighted graph G can be obtained from σ.It is now helpful to articulate the geometry more carefully; Figure 5 depicts how the discrete connectivity relates to the manifold geometry. In particular, for a real dataset, the few closest points surrounding xi are the best candidates for “nearest” neighbors—this is all that can be asserted locally. Let pi and pj be the projections of two neighbors xi and xj onto M, respectively. Then, 12 any point along the geodesic between pi and pj should be closer to no sampled point other than 
+
+xi or xj . By further assuming xi ∈ M , ∀i or at least that ‖xi − M‖ Rn < ε, ∀i and small ε,then ‖xi − xj ‖Rn approximates the geodesic when the curvature between pi and pj is small. Equivalently, the line segment `ij between xi and xj lies on the tangent space TpM, where p is the midpoint between pi and pj ; see Figure 5. The existence of a geodesic follows from identifying the tangent plane that includes the points with the exponential map of the manifold around them. Such an “edge-centric” approach connects differential geometry to the underlying graph. This is illustrated in Figure 5, where the kernel values are shown as shading in the tangent plane. Notice how xi and its neighbor xj both fall under the bright kernel values; i.e., they are very similar (in this measure) to each other. Stated in geometric terms, we assume that the neighbors lie within the injectivity radius around p. In fact, we will show (Figure 14) that the value of a multiscale kernel between two data points is equivalent to that of a rescaled, single-scale kernel centered at the midpoint between those two points. 
+
+Figure 5: Relating the discrete neighborhood graph to manifold geometry. Nearby sampled points (i and j) on a patch of manifold M lie in (or near) the tangent plane TpM to the midpoint ( p). Line segments (edges) between neighboring points lift, via the exponential map, to geodesics in 
+
+M. The continuous kernel extends this discrete relationship to the full tangent plane. The values of the kernel centered at p are shown as shading, extending in every direction in TpM. Our algorithm shall enforce this relationship, i.e., the consistency between discrete edges and large kernel values. Now, the optimized scales can be used to evaluate the current approximation and identify the edges in G that are “too expensive,” i.e., are likely to violate the local feature size. We proceed by computing successive refinements of both G and σ, in an iterative manner, until no further change is observed. We then return the final version of the discrete and weighted graphs (denoted by G?
+
+and G?, respectively). 13 One can view the computation of G as a relaxation of the discrete connectivity in G. In fact, as we shall see in section 3.5, a relaxation statistic, δ′
+
+> i
+
+, will be used to prune discrete edges that produce a poor approximation. More specifically, when a node i with degree, deg( i), in G has δ′
+
+> i
+
+close to 1, it means i has retained approximately the same degree in G, only continuously spread as a Gaussian around it. Each of the steps above are listed in Algorithm 1 and will be described in detail. We begin with the discrete connectivity rule (Gabriel graph); then the scale optimization is developed, followed by the edge-pruning step. Figure 6 illustrates the results of our algorithm on datasets for which the Gabriel graph alone cannot infer a good approximation of the manifold connectivity. 
+
+Algorithm 1 Iterated Adaptive Neighborhoods kernel  
+
+> 1:
+
+procedure IAN KERNEL (D) . Input: distance matrix, D 
+
+> 2:
+
+G(0) ← GABRIEL GRAPH (D) . Compute initial G (sec. 3.3)  
+
+> 3:
+
+repeat Iteration  
+
+> 4:
+
+σ(t), ← OPTIMIZE SCALES (G(t), D ) . Update scales σ (sec. 3.4)  
+
+> 5:
+
+G(t) ← MULTI SCALE KERNEL (D, σ(t)) . Weighted graph (eq. 13)  
+
+> 6:
+
+δ′, C ← COMPUTE VOLUME RATIOS (G(t), σ(t)) . Statistic δ′ (sec. 3.5.1)  
+
+> 7:
+
+G(t+1) ← SPARSIFY (G(t), δ ′) . Update G (sec. 3.5.2)  
+
+> 8:
+
+until no further change in G 
+
+> 9:
+
+return G?, G?, σ? . Output: final graphs and optimal scales  
+
+> 10:
+
+end procedure 
+
+## 3.3 Neighbors in a Gabriel graph 
+
+We begin by defining a set of putative neighboring points of xi (denoted by N (i)), which uses the connectivity rule found in a Gabriel graph [48, 76]. It directly incorporates the observation that closest neighbors should have no points “between” them. 
+
+Remark 1. Two points, xi and xj , are Gabriel-nearest neighbors to each other if and only if they both touch the same closed ball, Bij , that is empty except for xi and xj .Note that Bij is therefore a medial ball , i.e., a ball whose center point is a medial axis (with respect to the set of sampled points). Thus, this connectivity criterion can be restated as creating an edge for all those medial balls, and only those, touching exclusively two points (to be clear, if a third point touches Bij no edge shall be formed between xi and xj ). Hence, to each edge eij is associated a medial ball Bij centered and the midpoint between xi and xj with radius ‖xi − xj ‖/2
+
+(see Figure 7). This is furthermore equivalent to the following alternative definitions: 
+
+Remark 2. Points xi and xj are Gabriel-nearest neighbors if and only if any point along the line segment `ij = xixj in Rn has either xi or xj (or both) as its only closest point(s). 
+
+Remark 3. In terms of the Voronoi diagram [44] of X (with the cell around xi denoted by Vi), xi
+
+and xj are neighbors when `ij crosses a single Voronoi hyperplane Hij (namely that between the cells Vi and Vj ) and the midpoint between xi and xj is in Hij .14 G (0) 
+
+A G ⋆
+
+¾⋆ G⋆
+
+G (0) 
+
+B G ⋆
+
+¾⋆ G⋆Figure 6: See next page. 
+
+15 Figure 6: Steps of Algorithm 1 on toy datasets. ( A) Dataset with several challenges: non-uniform density, non-uniform dimension, and high curvature. After pruning 6 edges (dashed red lines) from the original Gabriel graph, G(0) , the algorithm converges, inferring reasonable discrete neighbor-hoods ( G?); the optimal scales σ? produce a weighted graph G? whose connectivity closely approx-imates that of G?. ( B) Dataset with three Gaussian clusters of non-uniform density. The Gabriel graph approximation, G(0) , naively connects all clusters using multiple edges. After convergence, the clusters become disconnected in G?, and its weighted version follows this by assigning negli-gible weights (due to σ?) between points in different clusters. As a concrete example (refer to Figure 7), consider two points xi and xj at a distance rij 
+
+from each other, with midpoint p. Assume the region in the manifold between them is uniformly sampled. Now consider the ball centered at p with radius rij /2, therefore touching xi and xj . If there are no points in its interior, we say xi and xj are nearest neighbors. Conversely, if it contains other points in its interior, under our assumption of uniform density this means that there is at least one other point xk “between” xi and xj . So we say that xi and xj are not nearest neighbors, in the sense that connecting xi and xj directly would be “crossing over” xk; this implies that an edge eij in the resulting graph would be a poor approximation to a geodesic in M (i.e., if M is “locally uniformly sampled,” the segment `ij would be passing outside of M). Note that, even when the input to the algorithm is solely a distance matrix (i.e., with no position information), this connectivity criterion may still be evaluated by considering the triangle xi–xj –xk and using Apollonius’s theorem to compute the length of the median from xk to p (Figure 7-D). 
+
+Figure 7: Connecting “nearest neighbors.” ( A) A set of data points in space. ( B) An edge can be formed between xi and xj because there is no other point in the interior of the ball Bij centered halfway between xi and xj . ( C) Here, because of the presence of a third point xj inside Bij , xi and 
+
+xj cannot be neighbors. ( D) Even in the absence of the original data point coordinates, i.e., given only the distances between all pairs of points, Apollonius’s formula can be used to determine the length of the segment p–xk, where p is the center of Bij . Namely, p–xk is a median of the depicted triangle. Here, because the length of the median is less than the radius of Bij , xi and xj cannot be neighbors. ( E) Edges are drawn connecting points xi to xk and xk to xj because both Bik and Bjk 
+
+are empty except for those pairs of points, respectively. The Gabriel graph is a subgraph of the Delaunay graph [37], and enjoys a number of key properties [76]. We emphasize: ( i) they are scale invariant, i.e., there is no pre-specified threshold on the diameter of medial balls that can form connections; ( ii ) the guarantee that Gabriel graphs 16 connect points to their true nearest neighbors when M is uniformly sampled as a grid (shown in Figure 9); and ( iii ), Gabriel graphs provide a locally-adapted neighborhood size ki, for each point 
+
+xi, based on the local geometry. Crucially, they do not require an initial guess of the number of neighbors, of the intrinsic dimensionality, or of a maximum neighborhood radius. Nevertheless, the neighborhoods given by the Gabriel graph are not sufficent. We now expand on a few of their properties—these will be useful in motivating the rest of the algorithm. 
+
+3.3.1 Closing triangles 
+
+Here we show that the edges created using the above connectivity rule can only form acute triangles in Rn. Let three points xi, xj , xk be such that xi and xk are connected, as well as xj and xk. The rule says, xi and xj shall be connected only if xk is outside the closed ball Bij of radius R = rij /2
+
+centered half-way between xi and xj (where rij stands for the Euclidean distance between xi
+
+and xj ). Using Apollonius’s formula for the squared distance m2 between xk and the midpoint between xi and xj , we obtain 
+
+m2 = 14(2 r2 
+
+> ik
+
++ 2 r2 
+
+> jk
+
+− r2 
+
+> ij
+
+). (14) Then, xk is in Bij if and only if m2 ≤ R2, so 
+
+14(2 r2 
+
+> ik
+
++ 2 r2 
+
+> jk
+
+− r2 
+
+> ij
+
+) ≤ R2 = ( rij 
+
+2 )2
+
+r2 
+
+> ik
+
++ r2 
+
+> jk
+
+≤ r2 
+
+> ij
+
+.
+
+(15) Notice that equality will hold when xi–xj –xk is a right triangle. Therefore: 
+
+Remark 4. A triangle will be formed by edges in a Gabriel graph only when it is acute (see Fig-ure 8-A). 
+
+3.3.2 Maximum curvature 
+
+The above result leads to a bound on the maximum principal curvature that is allowed locally on 
+
+M such that the Gabriel graph correctly approximates it (i.e., without closing a triangle). Assume 
+
+xi, xj , and xk are points in a smooth manifold M as in Figure 8-B, up to the level that the sampling defines. If we assume that the curvature, κ, is locally constant, then the geodesic from xi
+
+to xj passing through xk is an arc of a circle C. Therefore, the segments `ik and `kj approximate geodesics on M, but not `ij (which would cause “folding”). Hence, values of curvature that can be correctly inferred are those that do not create an edge between xi and xj (i.e., those for which the ball Bij is non-empty). In this case, from equation 15, the maximum such curvature, κmax , occurs when xi, xk, and xj form a right triangle in space (as any larger value would cause this triangle to be acute, connecting xi to xj ). Then, from Thales’s theorem, the diameter D of C would equal that of the hypotenuse `ij , so 
+
+κmax = 1
+
+D/ 2 = 2
+
+D = 2
+
+√
+
+r2 
+
+> ik
+
++ r2
+
+> jk
+
+. (16) 17 Figure 8: Implications of the connectivity rule in a Gabriel graph. (A) Closing triangles from edges: three points will be mutual neighbors if and only if they form an acute triangle (left). If the angle between xi and xj at xk is at least π/ 2, all three points will lie in Bij , so no edge is cre-ated (right). ( B) The maximum principal curvature in M (shown in blue) that can be reasonably approximated by the resulting graph geodesic (path) is constrained by the sampling interval. The limiting case occurs when three points form a right triangle (top), cf. equation 16). When sam-pling is too sparse (bottom left), a triangle may be formed, in this case preventing the graph from adequately capturing the manifold’s geometry. As sampling frequency increases (bottom right), higher curvatures can be better approximated. A special case to consider is when M is uniformly sampled with constant interval T over arc length. Then, the arc length s between i and j is 2T ; but, since rij = D, s covers half the circle and we have 2T = πD/ 2. Equation 16 then becomes 
+
+κmax (T ) = π
+
+2T . (17) 
+
+Remark 5. Equations 16 and 17 define the maximum geodesic curvature in M that can be ade-quately inferred from a Gabriel graph. As a consequence, the reach is lower-bounded by 1/κ max .
+
+3.3.3 Degree distribution in Gabriel graphs 
+
+We now study the above connectivity rule starting with flat, uniformly sampled manifolds (i.e., “regular grids”) to illustrate how Gabriel graphs naturally adapt to both their geometry and dimen-sionality. As shown in Figure 9-A, in such ideal cases the degree of an interior node in the Gabriel graph agrees with the true number of (literal) nearest neighbors, i.e.: 2 for collinear points, 4 for a square grid, and 6 for a triangular grid. Node degree appears to grow with dimension as 2d, except for the triangular grid (which, in some sense, looks too “non-generic”). Adding noise (Gaussian, with standard deviation equal to half the spacing between neighboring points) supports this conjecture, as the degree then ap-proaches 2d regardless of the original grid structure. This holds in higher dimensions as well, for both normal and uniform sampling at random (Figures 9-B,C and 12). 18 Remark 6. The expected number of neighbors in a Gabriel graph approximately follows a dis-tribution centered at 2d (where d is the intrinsic dimension of the data) for a variety of sampling strategies (Figure 9-C). Importantly, because Gabriel graphs are inherently scale invariant, this degree distribution is largely independent of sampling density. How to explain such remarkable regularity despite the randomness of sampling? A comple-mentary geometric view of the Gabriel graph connectivity rule is illuminating: each edge between data points implies an “occluding hyperplane” that blocks other points from becoming neighbors (see Figure 10). For example, when d = 1, two points necessarily occlude any additional connec-tions, and every non-boundary point must have 2 neighbors. Now, using the diagrams in Figure 11 as reference, we find that, when d = 2, on average ∼4 points are sufficient to occlude a point xi
+
+from all sides. For d = 3 this number is doubled again, and the expected number of neighbors becomes ∼8, revealing the trend. Every additional dimension adds a new coordinate axis along which the previous constraints are duplicated, roughly doubling the average number of directions available from which neighbors can connect. Once 2d balls are “attached” to xi, the remaining space is greatly reduced, and so is the probability of drawing a sample point from inside the region 
+
+H enclosed by the hyperplanes. When the neighbors are regularly spread around xi, by construction this region H is equivalent to a d-dimensional orthoplex 2 (or cross-polytope). A d-orthoplex has 2d facets (or (d-1) -faces), and is one of the three finite, regular, convex polytopes that exist in dimension higher than 4 (the other two being hypercubes and simplices). Naturally, when sampling is not uniform, we should find irregular orthoplexes instead. While this geometric construction supports our empirical results, and implies they should hold in higher dimensions, it also suggests the following: 
+
+Remark 7. Our experiments on the growth in dimension of randomly sampled points agree with a model in which Gabriel neighbors lie approximately in the facets of an orthoplex. We shall later use the additional observation that the dual polytope (a d-hypercube) of an orthoplex is obtained by placing a vertex (i.e., a neighbor) in each of its 2d facets. The Gabriel graph enjoys many attractive properties, and provides the starting point for our algorithm. The above arguments show how the space is largely filled by “Gabriel balls” within the manifold, but such balls may also fill space across holes and bottlenecks; curvature must be dealt with. Examples were given in Figure 6, where we showed that Gabriel connections can arise incorrectly and must be removed. To do so, one must “look” in every direction (of the tangent plane), and past immediate neighbors. For this, we now develop the weighted graph counterpart to the Gabriel graph, exploiting the kernel to extend local information globally. This begins to connect the graph construction more directly to manifold properties. 
+
+> 2An orthoplex is a line segment in 1-D, a square in 2-D, a regular octahedron in 3-D, a 16-cell in 4-D, etc.
+
+19 Figure 9: Regularity of node degree distribution in Gabriel graphs with random sampling. (A)Node degree in graphs computed from regular grids (constant sampling interval, T ) and their jit-tered versions (Gaussian noise with std. dev. 0.5T ). Top: A sequence of collinear points (left) produces a one-dimensional grid (center). Addition of noise (right) does not change the mean de-gree (constant 2 for interior points). Middle: A square grid (left) results in a quadrilateral mesh with constant degree 4 in its interior. Although addition of noise considerably scrambles the points, the mean degree is roughly unchanged. Bottom: Points arranged as a triangular grid (left) result in a triangular mesh where every interior node has degree 6. Its noisy version looks similar to a noisy square grid, with mean degree also approximating 4. (Cont. next page.) 
+
+20 Figure 9: (Cont. from previous page.) (B) Degree distribution for interior points of d-dimensional triangular and square grids after addition of Gaussian noise. Moderate amounts of noise are suffi-cient to make the mean degree become approximately 2d. Error bars indicate standard deviation; dotted lines show constant 2n values for reference. ( C) Mean degree of d-dimensional manifolds sampled using different strategies: uniformly at random, normally at random, and as jittered ver-sions of regular triangular and square grids (as in ( A), added Gaussian noise with std. dev. 0.5T ). Remarkably, mean degree grows approximately as 2n regardless of the sampling strategy. 
+
+Figure 10: (A) A central point xi (in blue) and its neighbors (in black). Every neighbor xj of xi
+
+will “occlude” the entire area behind a hyperplane tangent to Bij at xj (dashed lines). That is, no point inside the occluded areas (shaded region) can form a connection with xi. Here, the dashed ball does not form a connection between xi and xk because xj lies exactly on its boundary; despite this, xk still contributes with an occluding hyperplane, preventing farther points from connecting to xi. ( B) In principle, there is no limit to the number of neighbors a point in ambient space Rn may have (when n ≥ 2); e.g., any number of points lying exactly on a hypersphere around xi (dotted curve, in orange) will not occlude one another. Sets of nodes with connectivity such as this are termed “wheels” in graph theory, and the more points they contain, the less likely they are to occur in real datasets. In this example, any appreciable variability in the distance from xi to its neighbors would cause one (or several) of them to become occluded. ( C) Points inside occluded areas can also contribute with additional occluding hyperplanes. Here, although xk lies inside the region occluded by xj (and therefore cannot form a connection with xi), it produces further occlusion behind a hyperplane of its own (region shaded in red). So xl cannot connect to xi, either, due to the presence of xk (even though it is not occluded by xj ). 21 orthoplex Figure 11: Occlusion hyperplanes (shown in gray) due to neighbors in dimensions 1, 2, and 3 (A–C, respectively); compare with Figure 10. Every additional dimension adds a new coordinate axis along which the previous constraints are duplicated, roughly doubling the average amount of directions available from which neighbors can connect. Once 2d Gabriel balls are “attached” to xi,the remaining space is greatly reduced, and so is the probability of drawing a sample point from inside the region enclosed by the hyperplanes. 22 Figure 12: Distribution of node degree in the Gabriel graph of datasets with different sampling strategies and dimensionalities. Top: Points sampled normally (blue) or uniformly (orange) at random from a two-dimensional ball result in similar degree distributions centered at 22. Bottom: 
+
+In higher dimensions, interior points continue to follow this pattern. On the left, a 4-dimensional unit ball sampled uniformly at random is shown projected onto R3, with boundary points labeled as those with vector norm > 0.9 (edges omitted for clarity). It produces a Gabriel graph where interior points have degree distribution centered at ∼24, and the mean degree of boundary points is close to 23.23 3.4 Multiscale optimization 
+
+We now begin to develop the iteration in Algorithm 1, given the initial Gabriel neighborhood graph, G(0) . Assuming (temporarily) that this gives correct local neighborhoods, what should the corresponding scales be for a Gaussian kernel? In effect this is an extension of G into a weighted counterpart, G. From Figure 5, this weighted graph is also a type of approximation of (aspects of) the continuous manifold. Because density is not necessarily uniform, different points might have different neighborhood radii, so a multiscale Gaussian similarity kernel (equation 13) is used. Each point xi has its own associated scale, σi. To develop the computation of such scales, we now move into the continuous domain and exploit the geometric notion of a cover. 
+
+3.4.1 Covering criterion 
+
+A criterion for separability between two Gaussians has been developed in the mixture-of-Gaussians literature [33, 103, 6]: two spherical Gaussians, i and j, can be distinguished (in the sense of solving a classification problem) with reasonable probability when they have a separation of at least 
+
+‖μi − μj ‖ > C max {σi, σ j }, (18) at which the overlap in their probability mass is a constant fraction [103]. We flip this around by using a different, but related, construction: consider Gaussians now centered at the midpoints (i.e., not on data points) to indicate whether nearby points should be connected, not separated (Figure 5 illustrates this construction directly). Furthermore, because we use a multiscale kernel (equation 13), the (non-normalized) Gaussian density becomes a function of √σiσj . Hence, we obtain a criterion for what we term C-connectivity :
+
+Definition: Two neighbors i and j in the discrete graph G = ( E, V ) are C-connected by the multiscale kernel when the geometric mean of their individual scales is at least the distance between xi and xj scaled by a positive constant, C:
+
+C‖xi − xj ‖ ≤ √σiσj . (19) The constant C plays a role in normalizing for unknown density; it will be developed in sec-tion 3.5.2. For now, we illustrate its role in the connection from graphs to manifolds. Figure 13 shows the graph over a set of data points, and the local scales obtained (by the algorithm below) for different values of C. Choosing C too large yields scales (and therefore Gaussians) that are too large, that is, their overlap has peaks. Choosing it too small yields scales that introduce holes. Choosing it correctly, the Gaussians form a covering of the manifold that approximates a partition of unity. Such partitions of unity are used in differential geometry to extend local information (in our case, the scales) to global information (a covering of the manifold). By choosing appropriate scales, i.e., scales that meet our criterion for all edges in E, we also ensure a covering of the edges, in the following sense: the value of the multiscale kernel Kij 
+
+between xi and xj is identical to that of a kernel re-centered at the midpoint p ≡ (xi + xj )/2 and re-scaled using half the geometric mean of σi and σj as its scale, σp:
+
+Kij = exp −‖ xi − xj ‖2
+
+σiσj
+
+= exp −‖ (xi − xj )/2‖2
+
+σiσj /22 = exp −‖ (p − xi)‖2
+
+σ2
+
+> p
+
+, (20) with σp ≡ √σiσj /2 (Figure 14). 24 C =1 C = 0.9 C =0.8 
+
+> 0.0 0.5 1.0 1.5 2.0 Summation of Gaussians
+
+Figure 13: Effect of hyperparameter C from equation 19 on the resulting weighted graph (left), optimal scales (middle), and manifold approximation (right, shown as the resulting summation over the Gaussian kernels around each point using their individual scales). For C = 1 (top), the scales overlap too much and, as a result, the Gaussian summation (right) is highly non-uniform. For 
+
+C = 0.8 (bottom), the scales are not sufficiently large to properly cover the underlying manifold, resulting in holes (right). When C = 0.9, there is a good compromise between covering and keeping a uniform density, so the Gaussian summation approximates a partition of unity (summing to ∼1everywhere) when the scales correctly conform to the local sampling characteristics. Our approach will allow us to tune C based on a relaxation statistic, δ′
+
+> i
+
+.
+
+Remark 8. We say a C-covering is attained when every pair (i, j ) ∈ E is C-connected (equa-tion 19). Additionally, when the spacing between neighboring points is approximately uniform locally, the pointwise summation over all Gaussian kernel bumps given by the individual scales provides an (un-normalized) partition of unity of M.We now use the covering constraints to solve for the set of scales, σ. It is desirable that the scales be small (respecting the reach), while at the same time maintaining the connectivity in G
+
+close to that of G. Thus, one idea is to find scales such that the sum of edge weights in G incident to a node i from its neighbors in G approximate the degree of i in G, for all i, while at the same time ensuring a C-covering. This, however, amounts to a non-convex problem in which the cost function involves a summation of multiscale kernel values. We are unable to solve this efficiently. 25 ijrij rij rij 2 rij ¾irij             
+
+> 2rij
+> 23rij ¾j
+> BAC
+> ¾i¾j=r2
+> ij ipj
+> 01kernel value scale: p¾i¾j=212p¾i¾j
+
+Figure 14: Covering constraint for the multiscale kernel of equation 13. Left: A graph G with two nodes i and j at a distance rij from each other in Rn. Since they are connected, their assigned individual scales σi and σj must satisfy σiσj ≥ r2 
+
+> ij
+
+, i.e., the covering constraint (here we assume 
+
+C = 1). Center: All feasible pairs ( σi, σ j ) lie inside the region above a positive hyperbola, three of which are indicated as colored points; pairs A and B satisfy exactly, while C satisfies in excess. Each one is also depicted as a pair of circles on the left plot using the same color code, each one centered at its corresponding node (radii are set to half the scale, for clarity). Although pairs A and B differ in their ratio σi/σ j , both result in the same multiscale kernel value for the edge ( i, j ), since the product σiσj is the same; pair C yields a slightly higher value. This illustrates the freedom that might exist in choosing an optimal combination of scales for all nodes (i.e., a covering). Right: 
+
+multiscale kernel values, Kij , centered at either i or j, shown in green, are symmetric (with scale 
+
+√σiσj ). Horizontal axis represents position over the line in Rn passing through i and j. A kernel centered at the midpoint p between i and j using half the scale (black curve) attains the same value as Kij at i and j. Dashed red line indicates the common value between the three kernels. Instead, we find the smallest individual scales such that our covering criterion is satisfied for all edges (a “minimal covering”), and later address the quality of the relaxation by using a statistical pruning (edge sparsification). This can be transformed into a convex, linear program with linear constraints by which all scales can be solved for simultaneously, as we show next. (We also present, in Appendix A, a greedy approach to this optimization that may be convenient when dealing with very large datasets.) 
+
+3.4.2 Linear program relaxation 
+
+To achieve a minimal covering, one might minimize ∑ 
+
+> i
+
+σi (or, equivalently, the 1-norm of the vector σ, since scales are positive) subject to the covering constraint 3. This suggests the following:               
+
+> 3Another possibility is to use a weighted sum ∑
+> iνiσiwhile keeping the same constraints, thus still guaranteeing a covering. The weights νiadd a bias to how the length of an edge is split between its two incident nodes (by balancing their individual scales). One interesting option is to set νi=rnon
+> i/r FN
+> i, i.e., the ratio between the distance to the nearest non-neighboring point, rnon
+> i, and the farthest neighbor, rFN
+> i.
+
+26 Optimization Problem: 
+
+min  
+
+> σ
+
+1ᵀσ
+
+s.t. (i, j ) is C-connected , ∀ (i, j ) ∈ Eσi is bounded, ∀ i ∈ V, 
+
+(21) where σ is the vector of individual scales, σi, and 1 is the all-ones vector. Now it remains to represent the C-covering requirement by a set of constraints. Looking in detail at C-connectedness (equation 19) as a function of σi and σj , observe that it represents a region delimited by a single-branched hyperbola (since the distance and scales are positive): 
+
+σiσj ≥ (Cr ij )2, σi > 0, σ j > 0, (22) where rij ≡ ‖ xi − xj ‖Rn ). Each σi is naturally bounded above by the distance to i’s farthest neighbor, rFN  
+
+> i
+
+:
+
+σi ≤ rFN  
+
+> i
+
+, (23) beyond which all neighbors are satisfied 4, so further increasing either scale would make the weights to non-neighbors larger than strictly necessary (thereby hurting the kernel graph relaxation). These bounds, combined, specify a bounding box for each edge that must necessarily be crossed (or at least touched) by the hyperbola, since rij > 0.0 ui¾i
+
+> 0
+> uj¾jv
+
+A  
+
+> 0ui¾i
+> 0
+> uj¾jv
+
+B 
+
+> 0ui¾i
+> 0
+> uj¾jv
+
+C  
+
+> 0ui¾i
+> 0
+> uj¾jv
+
+D   
+
+> original constraint bounds convexified feasible region
+
+Figure 15: Examples of constraints introduced by an edge, eij , in G. The C-connectivity rule, i.e., the hyperbola given by σiσj = ( C‖xi − xj ‖)2 (dashed curve), when convexified, may give rise to one or two linear constraints, depending on whether the hyperbola’s vertex v (point where σi = σj )intersects the bounding box given by the lines σi = 0 , σj = 0 , σi = ui, and σj = uj , where ui
+
+and uj denote upper bounds. Hatched area (in orange) shows feasible region using convexified constraints; tangent line at v is shown in gray. When v is interior to the bounding box ( A), two secants (in blue) define the feasible region (namely, the lines passing through v and the points where the hyperbola intersects the lines σi = ui and σj = uj ); when either v = ui (B) or v = uj
+
+(C), only one secant is necessary; when v coincides with both ui and uj (D) (which may occur if 
+
+C is set to 1), again only one inequality is necessary, namely the tangent line at v.        
+
+> 4That is assuming C≤1(a natural choice). If for some reason one needs to allow C > 1, then the upper bounds must be scaled by Cin order to ensure feasibility.
+
+27 Due to the hyperbolae, this amounts to a non-linear, non-convex set of constraints. However, we can convexify the feasible set by considering, for each edge (i, j ), the line(s) passing through the hyperbola’s vertex (the point at which σi = σj = Cr ij ) and the points where the hyperbola intersects the bounding box. The four possibilities are shown in Figure 15. The feasible region for each edge, therefore, is bounded by a convex envelope given by such line(s) and those defined by the upper bounds to σi and to σj . Such envelopes for all edges, combined, define the boundaries of a convex polytope. Note that this convexification is conservative in the sense that only the objective is relaxed—the feasible scales are always at least as large as required by the original non-convex problem, therefore our covering requirement is not relaxed. (Because of the presence of a later pruning stage in the algorithm, it is better to over-connect here than to inadvertently disconnect nodes that should otherwise be connected.) Letting m ≤ 2|E| be the total number of linear constraints obtained as above, and N the number of nodes in G, we define the m × N matrix Λ and the m × 1 vector b. Now, for each edge, 
+
+eij , let its two possible constraints be expressed as 
+
+σj ≥ α(1)  
+
+> ij
+
+σi + β(1)  
+
+> ij
+
+(24) 
+
+σj ≥ α(2)  
+
+> ij
+
+σi + β(2)  
+
+> ij
+
+(25) with αij and βij denoting, respectively, the slope and intercept of the corresponding line(s) forming its convex envelope. Rearranging, we obtain αij σi − σj ≤ − βij for each line, which is encoded as a row in Λ with values αij and −1 at columns i and j, respectively (with zeros everywhere else), and an entry in b with value −βij :
+
+Λ
+
+. . . i . . . j . . . 
+
+
+
+... ... ... ... ...
+
+e(1)  
+
+> ij
+
+0 . . . α(1)  
+
+> ij
+
+· · · 0 . . . −1 . . . 0
+
+e(2)  
+
+> ij
+
+0 . . . α(2)  
+
+> ij
+
+· · · 0 . . . −1 . . . 0
+
+... ... ... ... ...
+
+m × N
+
+σ
+
+N × 1
+
+≤
+
+b
+
+
+
+...
+
+−β(1) 
+
+> ij
+
+−β(2) 
+
+> ij
+
+...
+
+m × 1
+
+.
+
+Remark 9. The convex envelope defining the constraints can be expressed by the linear inqualities: 
+
+Λσ ≤ b
+
+0 < σ ≤ rFN , (26) where rFN is the vector of distances to each node’s farthest neighbor. Hence the problem now amounts to a convex, linear program (LP) with linear constraints: 
+
+Optimization Problem: LP Relaxation: 
+
+min  
+
+> σ
+
+1ᵀσ
+
+s.t. Λσ ≤ b
+
+0 < σ ≤ rFN ,
+
+(27) 28 which can be readily solved by a variety of methods [see, e.g., 21]. Figures 19, 21, and 20 show the results of running this optimization on different examples. 
+
+## 3.5 Sparsification 
+
+Summarizing what we have seen so far, the Gabriel graph provides an initial estimate of connec-tivity, while the LP optimization provides minimal scales for a continuous kernel to cover those connections. However, since the initial estimate of the discrete graph might contain incorrect con-nections, its resulting optimal scales might also be inadequate. An example of this can be seen in Figure 19: initially, two pairs of nodes are connected across the central gap since a Gabriel ball exists between them. This will require very large scales to “cover” these edges. Furthermore, the Gabriel graph is based on a local connectivity rule; however, as illustrated in Figure 16, decisions about connecting nodes across a gap should not be local. We here address both of these issues, by introducing a global statistic based on how frequently such a gap occurs in the data. In terms of Algorithm 1, we are now at steps 6 and 7. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Volume ratio statistic, ±0i
+
+> 010 20 30 40 50 60 70 80 Counts
+> median threshold
+
+Figure 16: Local vs. global assessment of neighborhoods. Left: The points inside the cropped win-dow appear to form two well-defined clusters when looked at up close (local estimation). However, when considered in the context of the full dataset (global estimation), the apparent gap between the top and bottom groups “disappears,” i.e., it is well within the range of gaps observed through-out the data. More precisely, it does not significantly deviate from the average sampling interval. 
+
+Right: The converged graph G indeed connects the two groups by edges, and the distribution of volume ratios, δ′ 
+
+> i
+
+(lower inset), confirms that all edges are reasonable. 
+
+3.5.1 Volume ratio 
+
+Because incorrect connections can be given by Gabriel balls lying in the free space between parts of a manifold, i.e., across the medial axis, it is tempting to simply prune the longest connections. Note, however, that the size of a scale by itself is not necessarily important: in both examples shown in Figure 6, the non-uniform density causes scale sizes to vary considerably, and even the largest ones are appropriate, that is, are still consistent with the distances to neighboring points. 29 Conversely (and importantly), a scale that is excessively large will likely cover “too many” points. That is, it will cover neighbors in excess of the number of discrete neighbors of its cor-responding node in G. We quantify this notion by observing that an individual scale, σi, should produce kernel values whose sum is comparable to the discrete degree, deg( i), of node i in G. As will be shown, after proper normalization this also means σi shall relate to a local volume element around xi, or the inverse of the local density. Since each connection in G can be seen as having unit weight, a Gaussian kernel around xi with scale σi should distribute that same amount, deg( i),only continuously over ambient space. We start our derivation with a definition: 
+
+Definition: let w(σi) 
+
+> ij
+
+be the Gaussian kernel value between xi and xj using scale σ(t)
+
+> i
+
+at iteration t. A (non-isolated) node’s volume ratio at iteration t, denoted by δ(t) 
+
+> i
+
+, is defined as 
+
+δ(t) 
+
+> i
+
+≡
+
+∑ 
+
+> j∈V
+
+w(σ(t) 
+
+> i)
+> ij
+
+∑ 
+
+> j∈V
+
+aij 
+
+, (28) i.e., the ratio between node i’s weighted degree due to σ(t) 
+
+> i
+
+and its discrete degree in G(t) (hence-forth we suppress the iteration dependency ( t) to simplify notation). An individual-scale Gaussian kernel is needed to correctly assess the impact of σi on the relaxation from the perspective of i
+
+alone—the multiscale kernel here might artificially increase the weighted degree of i when other nodes (even non-neighbors of i!) have incorrect scales. (Nevertheless, as discussed below, a cor-responding ratio using the actual weights in G may eventually be used for convergence purposes.) Now, using a mean-value integral [as in 30], the numerator approximates the volume under the continuous Gaussian kernel over M, and can be further approximated by 
+
+∑
+
+> j
+
+w(σi) 
+
+> ij
+
+≈ N
+
+vol( M)
+
+∫
+
+> M
+
+exp 
+
+(−‖ xi − xj ‖2
+
+σ2
+
+> i
+
+)
+
+dxj (29) when M has uniform density and low curvature. In practice, the kernel will have compact support due to numerical precision (i.e., its values become effectively zero for sufficiently large distances), so by defining the volume element dV i ≡ vol( N (xi)) /|N (xi)| of a neighborhood N (xi) ∈ M 
+
+around xi, we may rewrite equation 29 as 
+
+∑
+
+> j
+
+w(σi) 
+
+> ij
+
+dV i ≈
+
+∫
+
+> M
+
+exp 
+
+(−‖ xi − xj ‖2
+
+σ2
+
+> i
+
+)
+
+dxj (30) when the sampling is approximately uniform around xi. By further assuming that σi is small, and that M can be well-approximated locally by its tangent space Rd, then 
+
+∫
+
+> M
+
+exp 
+
+(−‖ xi − xj ‖2
+
+σ2
+
+> i
+
+)
+
+dxj ≈
+
+∫
+
+> Rd
+
+exp 
+
+(−‖ xi − xj ‖2
+
+σ2
+
+> i
+
+)
+
+dxj = ( √πσ i)d, (31) so ∑
+
+> j
+
+w(σi) 
+
+> ij
+
+dV i ≈ (√πσ i)d, (32) as shown in Figure 17. 30 An analogous derivation for the discrete degree summation is as follows. First, note that the edge weight in this case is a constant (unity); it remains to determine its support over M. From section 3.3.3, we know that, for simple manifolds with random sampling, the node degree deg( i)
+
+in a Gabriel graph is approximately 2di within a region of constant intrinsic dimensionality, where 
+
+di denotes the local intrinsic dimension around xi (possibly different around other points in X )5.In more general manifolds, we expect the converged graph G? instead to approach such a property. This means ∑ 
+
+> j
+
+aij ≈ 2di will approximate the volume of a hyperrectangle (or box) of unit height and having a di-dimensional hypercube of side 2 as its base 6. So, by defining ρi as the radius of the local volume element dV i (such that ρi = di
+
+√dV i), we may write: 
+
+∑
+
+> j
+
+aij dV i ≈
+
+∫ ρi
+
+> −ρi
+
+· · · 
+
+∫ ρi
+
+> −ρi
+
+1dx j1 . . . dx jd = (2 ρi)di , (33) as illustrated in Figure 17. Hence, ρi is a kind of “neighborhood radius” of xi.From equations 32–33, equation 28 becomes 
+
+∑ 
+
+> j
+
+w(σi)
+
+> ij
+
+∑ 
+
+> j
+
+aij 
+
+=
+
+∑ 
+
+> j
+
+w(σi) 
+
+> ij
+
+dV i
+
+∑ 
+
+> j
+
+aij dV i
+
+≈
+
+(√πσ i
+
+2ρi
+
+)di
+
+, (34) representing the ratio between the volume of a Gaussian with scale σi and that of a box of side 
+
+2ρi and height 1 (cf. Figure 17). As the algorithm approaches convergence, we expect σi ≈ ρi
+
+(scales are compatible with neighborhood radius) and deg( i) should approach, on average, the empirically-observed value of 2di (meaning that the number of neighbors in G is compatible with dimensionality of M). This results in 
+
+∑ 
+
+> j
+
+w(σi)
+
+> ij
+
+∑ 
+
+> j
+
+aij 
+
+≈
+
+(√π
+
+2
+
+)di
+
+. (35) Finally, we can estimate di as 
+
+˜di ≡ log 2
+
+∑
+
+> j
+
+aij , (36) based on the empirical degree distribution of G(t). From this, we can compute a normalized volume ratio, δ′(t) 
+
+> i
+
+, dividing δ(t) 
+
+> i
+
+by the value from equation 35: 
+
+Definition: A node’s normalized volume ratio is computed as 
+
+δ′(t) 
+
+> i
+
+≡
+
+∑ 
+
+> j
+
+w(σi)
+
+> ij
+
+∑ 
+
+> j
+
+aij 
+
+( 2
+
+√π
+
+) ˜di
+
+. (37) Nodes whose degree deviate from exactly 2di will, likewise, under- or overestimate the local dimension, so reasonable volume estimates are still obtained regardless. However, in order to avoid        
+
+> 5We abuse notation, therefore, when we say “ d-dimensional manifold”, or “ M ∈ Rd”.
+> 6This agrees with our observation (section 3.3.3) that the unoccluded region around xiis similar to a di-orthoplex: by placing a vertex (i.e., a neighbor) in each of its 2difacets, we obtain a di-hypercube, which is the dual polytope of an orthoplex.
+
+31 dimension less than 1 for connected nodes, in practice when deg( i) = 1 we replace ∑ 
+
+> j
+
+aij with 
+
+max {2, ∑ 
+
+> j
+
+aij }.Thus, we expect δ′ 
+
+> i
+
+≈ 1 for points obeying σi ≈ ρi and ˜di ≈ di. Crucially, points for which these conditions are not met (those having “wrong” neighbors in the original Gabriel graph, G(0) )will depart from this by having δ′ 
+
+> i
+
+ 1. In the next section, we shall use this fact to guide a sparsification of edges in G(0) based on δ′
+
+> i
+
+.vol-ratios 
+
+Figure 17: Computing the volume ratio between continuous and discrete degrees of a node i with neighboring points sampled uniformly over a d-dimensional manifold M. Top row: Using a Gaus-sian kernel, the weighted degree of i (sum of kernel values ∑ 
+
+> j
+
+w(σi) 
+
+> ij
+
+) in G approximates the volume of a Gaussian with scale σi (equation 32). Bottom row: The number of edges adjacent to i in G
+
+(sum of unit weights) approximates the volume of a box with unit height and a hypercube of side 
+
+2ρi as its base, where ρi is the radius of a local volume element of M around xi (equation 33). 
+
+Right: When the scale σi is compatible with ρi, the volume ratio, δi, is expected to be approxi-mately (√π/ 2) d, and therefore is a scale-invariant quantity. Interestingly, δ′ 
+
+> i
+
+can also be interpreted as measuring how well the scale σi fits the local volume element dV i (or, equivalently, how it counteracts the local sampling density, 1/dV i). Since dV i =
+
+ρdi 
+
+> i
+
+(from the definition of ρi), we may rewrite equation 34 as: 
+
+∑ 
+
+> j
+
+w(σi)
+
+> ij
+
+∑ 
+
+> j
+
+aij 
+
+≈ (√πσ i)di
+
+2di dV i
+
+. (38) Summarizing the above, when ˜di ≈ di and σi ≈ ρi we have: 
+
+Remark 10. A node’s normalized volume ratio may alternatively be expressed as 
+
+δ′(t) 
+
+> i
+
+≡
+
+∑ 
+
+> j
+
+w(σi)
+
+> ij
+
+∑ 
+
+> j
+
+aij 
+
+( 2
+
+√π
+
+) ˜di
+
+≈ (√πσ i)di
+
+2di dV i
+
+( 2
+
+√π
+
+) ˜di
+
+≈ σdi
+
+> i
+
+dV i
+
+. (39) Therefore, δ′ 
+
+> i
+
+can be thought of as the product between kernel scale and local density. When σi is optimal, it should be approximately equal to the inverse of the local density, so δ′ 
+
+> i
+
+≈ 1.32 3.5.2 Uniformity of sampling and edge pruning 
+
+Since δ′(t) 
+
+> i
+
+is evaluated for every node xi, we can collect it across nodes and view it as a statistic. This has two consequences: ( i) it can be used to enforce consistency in sampling, and ( ii ) outliers in this statistic are likely candidates for edge pruning. We address consistency of sampling first. We have several times stated that sampling is required to be locally uniform, although its rate may change over the manifold. Examples of this were shown in, e.g., Figure 12, where the sam-pling was denser in the center of the Gaussian distribution than in the periphery. This example differs from the regular grids, in which all nearest neighbors had exactly the same distance. Putting this together, we have: 
+
+Remark 11. Locally Uniform Sampling: Let node i have ki neighbors in G(t). Among these, let 
+
+rFN  
+
+> i
+
+denote the distance from xi to its farthest neighbor, and rNN  
+
+> i
+
+that to its nearest neighbor. When 
+
+rFN  
+
+> i
+
+≈ rNN  
+
+> i
+
+for all i, we say the sampling is locally uniform. This is useful because a departure from the assumption that sampling is locally uniform will cause δ′ 
+
+> i
+
+to be on average greater than 1 throughout the dataset. To see this, when sampling is not uniform, we have rFN  
+
+> i
+
+> r NN  
+
+> i
+
+. Now, since σi is optimized to cover all of i’s neighbors, it will have in most cases the same order of magnitude as rFN  
+
+> i
+
+(minus some possible slack due to the multiscale interaction). Therefore, the higher the variability in the neighbors’ distances, the larger the difference between rFN  
+
+> i
+
+and rNN  
+
+> i
+
+will be, making σi, in turn, be larger than the distance to most neighbors of i. Ultimately, this will increase ∑ 
+
+> j
+
+w(σi) 
+
+> ij
+
+beyond what we would have in a uniform-sampling scenario (in which rFN  
+
+> i
+
+≈ rNN  
+
+> i
+
+). When data are acquired using a global sampling strategy, this variability in the neighbors’ distances should be roughly constant throughout the dataset (rather than the distances). So we use the scalar parameter, C, from equation 19 to correct for this “bias” and bring the median of the distribution of δ′(t) 
+
+> i
+
+(denoted as 〈δ′(t) 
+
+> i
+
+〉) close to 1 7.
+
+Remark 12. Let the tuned C?(t) be that which causes 〈δ′(t) 
+
+> i
+
+〉 to be closest to 1. Typically, C?(t) < 1, which, in the scale optimization procedure, means that the covering constraints (equation 22) are being relaxed using the distribution of δ′ 
+
+> i
+
+as a guide (Figure 18). Note that, although the tuning of C is not necessary for finding candidates for sparsification, it attributes a quantitative meaning to the value of δ′
+
+> i
+
+, so any δ′ 
+
+> i
+
+ 1 is guaranteed to indicate the need for edge pruning. Such tuning should be performed at t = 0, and repeated as needed over the iterations whenever 〈δ′(t) 
+
+> i
+
+〉 deviates too much from unity (which may happen after several edges have been pruned). Most commonly, we find 0.5 < C ?(t) < 1.Thus, we have a data-driven way of finding an appropriate value for C. Because it is a global constant applied to all connection constraints, it shifts the distribution of δ′ 
+
+> i
+
+to have median around 1 without changing its general shape. This leads us to the second use of our statistic: any node whose normalized volume ratio is much greater than the median of the population should be identified as an outlier. Such nodes will have a neighbor considerably farther than its other neighbors (relative to the median variability of such neighboring distances throughout the data), and are candidates for the sparsification step.   
+
+> 7Although the mean typically gives smoother tuning curves, the median is more robust. This matters, because of the possible outlying δ′
+> ivalues.
+
+33 G⋆      
+
+> 0.70 0.75 0.80 0.85 0.90 0.95 1.00
+
+C
+
+> 0.7 0.8 0.9 1.0 1.1 1.2
+> ± 0
+> i
+> ®
+
+C⋆ = 0.915 
+
+## G⋆  
+
+> 0.5 1.0 1.5
+
+±0i   
+
+> 010 20 Node count  C=0.800
+> 0.5 1.0 1.5
+
+±0i   
+
+> 010 20
+> C=0.915
+> 0.5 1.0 1.5
+
+±0i 
+
+> 010 20
+> C=1.00
+> median
+> ®
+
+Figure 18: Tuning the hyperparameter C based on the median of the distribution of normalized volume ratios, 〈δ′
+
+> i
+
+〉. Left: Converged unweighted graph G? obtained for the dataset from Figure 19. 
+
+Center: After computing 〈δ′
+
+> i
+
+〉 for a range of values of C ≤ 1, the optimal C? is that resulting in a 
+
+〈δ′
+
+> i
+
+〉 closest to 1. Histograms below show the distribution of δ′ 
+
+> i
+
+for different values of C, including 
+
+C? = 0.915. Right: The resulting weighted graph G? after C-tuning typically exhibits a more uniform connectivity throughout (see Figure 13). 
+
+Remark 13. Nodes that are robust outliers according to the δ′ 
+
+> i
+
+statistic have an overly distant neigh-bor (relative to the other neighbors for that node) and hence are likely to be in violation of reach or other geometric constraints. These relatively distant neighbors are candidates for having an edge pruned. Given the distribution of normalized volume ratios, statistical models can be used to define a threshold for identifying outliers (see Figures 19–22). It is likely that datasets with a large number of problematic connections will exhibit a distribution with a heavy tail, or that looks like a mixture of two distributions (cf. example in Figure 22), so using the distribution’s quartiles may give a more robust result. One option that seems to work particularly well is to use estimates of the sample mean and standard deviation from the quartiles, as in [105] (throughout, we make use of the C3 method derived therein, setting the δ′ 
+
+> i
+
+threshold to 4.5 standard deviations above the mean thus estimated). Still, we found that results are typically quite invariant to this particular choice, especially in real-life datasets. Finally, we note that our algorithm can be run interactively, so the user can analyze the histogram of the distribution after each iteration to judge whether the choice of threshold is reasonable and thus be confident in the results. Nodes with δ′ 
+
+> i
+
+above the threshold should have their connection to their farthest neighbor deleted. Ideally, only one such connection is pruned after each iteration; however, should that 34 become impractical with large datasets, a compromise is to limit the pruning, at each iteration, to a single edge from each node that is above the threshold (giving the chance for its δ′ 
+
+> i
+
+value to be updated before the next pruning). 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 
+
+± 0                 
+
+> i(0)
+> 0510 15 20 25 Node count Volume ratios, t= 0
+> threshold ¾(0)
+> Individual scales, t= 0
+> G(1)
+> Edge sparsification, t= 0 ¡1
+> 0.50 0.75 1.00 1.25 1.50 1.75 2.00
+
+± 0    
+
+> i(2)
+> 0510 15 20 25 Node count Volume ratios, t= 2
+> threshold
+
+¾⋆   
+
+> Individual scales, t= 2
+> G⋆
+> Converged
+
+Figure 19: Optimal scales and associated normalized volume ratios at iterations 0 and 2 of Algo-rithm 1 on the horseshoe dataset (see Figure 13). Top row: the δ′ 
+
+> i
+
+statistic has median around 1 and several outliers. These are caused by the long edges and huge scales (middle). Right column: G(t)
+
+after iteration 1, with edges deleted shown in red (top), and after iteration 2 (bottom). 
+
+3.5.3 Convergence 
+
+The algorithm converges at iteration t when no point i has an outlying δ′(t) 
+
+> i
+
+(i.e., greater than a statistical threshold). This implies that no edges will be pruned, so G(t+1) = G(t) and therefore no further changes can occur to either σ(t) or G(t). Note that convergence is guaranteed: since at every iteration t an edge must be removed, the algorithm necessarily reaches a certain t at which all outliers (if there were any to begin with) have been pruned. If one is solely interested in obtaining G? (i.e., not interested in G?), an alternative convergence condition may be adopted that looks at the distribution of the (normalized) multiscale volume ratio ,
+
+δ′(t) 
+
+> iMS
+
+:
+
+δ′(t) 
+
+> iMS
+
+≡
+
+∑ 
+
+> j
+
+wij 
+
+∑ 
+
+> j
+
+aij 
+
+( 2
+
+√π
+
+) ˜di
+
+, (40) 35 0 5 10 15 20 
+
+± 0    
+
+> i(0)
+> 0510 15 20 25 30 35 40 Node count Volume ratios, t= 0
+> threshold
+
+¾(0)   
+
+> Individual scales, t= 0
+
+G (2)            
+
+> Edge sparsification, t= 0 ¡2
+> 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
+
+± 0    
+
+> i(3)
+> 010 20 30 40 Node count Volume ratios, t= 3
+> threshold
+
+¾(3)   
+
+> Individual scales, t= 3
+
+G (5)          
+
+> Edge sparsification, t= 3 ¡5
+> 0.6 0.8 1.0 1.2 1.4 1.6
+
+± 0    
+
+> i(6)
+> 0510 15 20 25 30 35 Node count Volume ratios, t= 6
+> threshold
+
+¾⋆  
+
+> Individual scales, t= 6
+
+G ⋆
+
+> Converged
+
+Figure 20: Optimal scales and associated normalized volume ratios at different iterations of Al-gorithm 1 on the dataset from Figure 6. The distribution of δ′ 
+
+> i
+
+(left) indicates those connections that are least likely to represent reasonable geodesics over the underlying manifold. Right column shows G(t) after iterations 2, 5 and 6 (deleted edges in red). analogous to equation 37 but using the weights from G(t) directly. Since the multiscale kernel takes into account the interaction between individual scales, the distribution of δ′ 
+
+> iMS
+
+will be typically tighter than that of δ′ 
+
+> i
+
+(i.e., some of the excessively large scales might be compensated by small neighboring scales). Therefore, one may wish to allow for an earlier convergence when there are no remaining outliers in the distribution of δ′ 
+
+> iMS
+
+.Finally, in applications where it is required that G? be connected, pruning can simply be 36 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 
+
+± 0    
+
+> i(0)
+> 010 20 30 40 Node count Volume ratios, t= 0
+> threshold
+
+¾(0)   
+
+> Individual scales, t= 0
+
+G (2)            
+
+> Edge sparsification, t= 0 ¡2
+> 0246810 12 14
+
+± 0    
+
+> i(3)
+> 010 20 30 40 Node count Volume ratios, t= 3
+> threshold
+
+¾(3)   
+
+> Individual scales, t= 3
+
+G (6)           
+
+> Edge sparsification, t= 3 ¡6
+> 0.5 1.0 1.5 2.0 2.5 3.0 3.5
+
+± 0    
+
+> i(7)
+> 010 20 30 40 Node count Volume ratios, t= 7
+> threshold
+
+¾⋆  
+
+> Individual scales, t= 7
+
+G ⋆
+
+> Converged
+
+Figure 21: Optimal scales and associated volume ratio statistics at different iterations of Algo-rithm 1 on the clustered dataset from Figure 6. Pruned edges (in red) are precisely those connecting the three clusters together. stopped before disconnection. Naturally, G? is always connected up to machine precision or some numerical tolerance. In closing this section, we return to one of our introductory examples and show, in Figure 23, the resulting graphs for the sampling Swiss cheese patterns (from Figure 2). When sampling is too sparse (bottom), there is only so much that can be inferred, and not all holes are free of edges after convergence. As sampling gets denser, however, the algorithm correctly identifies that edges across holes should be pruned (middle). When it is very dense (top), even the initial Gabriel graph 37 0 10 20 30 40 50 
+
+± 0 
+
+> i(0)
+> 0100 200 300 400 500 600
+> Node count
+
+Volume ratios, t = 0 
+
+> threshold
+
+¾(0) 
+
+Individual scales, t = 0 
+
+G (136) 
+
+Edge sparsification, t = 0 ¡ 136      
+
+> 0510 15 20 25
+
+± 0 
+
+> i(137)
+> 050 100 150 200 250
+> Node count
+
+Volume ratios, t = 137 
+
+> threshold
+
+¾(137) 
+
+Individual scales, t = 137 
+
+G (273) 
+
+Edge sparsification, t = 137 ¡ 273     
+
+> 0.5 1.0 1.5 2.0 2.5
+
+± 0 
+
+> i(274)
+> 020 40 60 80 100 120
+> Node count
+
+Volume ratios, t = 274 
+
+> threshold
+
+¾⋆
+
+Individual scales, t = 274 
+
+G ⋆
+
+Converged Figure 22: Optimal scales and associated normalized volume ratios δ′
+
+i after each iteration of the algorithm on the dataset from Figure 28 (here, seen from a lateral view). The number of initial connections in G(0) (Gabriel graph) is very large, so the initial distribution of δ′
+
+i shows two modes. However, ratios in right-side peak are very high, and are therefore easily identified as outliers. The algorithm converges soon after all edges crossing the gap are eliminated. is able to correctly infer the true holes. 38 X G ⋆ G⋆      
+
+> XG⋆G⋆
+> XG⋆G⋆
+
+Figure 23: Sampled Swiss cheese results (cf. Figure 2). The original sampled points (true holes outlined) are shown, together with the converged graphs. In the sparse case (bottom), sampling is close to locally uniform so not all holes are correctly inferred. As sampling gets denser (top two rows), no holes are violated. 
+
+## 3.6 Comparison with other kernel methods 
+
+We now compare the data graphs obtained using our iterated adaptive neighborhoods (IAN) with those from other popular manifold learning methods. In Figure 24, a synthetic “stingray” dataset exhibits a transition of apparent dimension from 2 (body) to 1 (tail), a variation of the scenario explored in Figure 4. Points were uniformly sampled, with 20% deleted at random. 39 Our converged, unweighted graph, G? (top row in Figure 24), can be compared with the tra-ditional k-nearest neighbors graph (bottom row), used in a variety of methods, including Isomap [96]. In the latter, when k = 2, the tail exhibits perfect connectivity, but the body is too sparse. If 
+
+k = 4, the body is more properly connected but the tail becomes overly connected, and “folding”, or “short-circuits”, start to appear. Finally, for k ≥ 8, the connectivity is inappropriate as the tip of the tail connects directly to the body. In contrast, G? manages to retain a minimally-connected tail while covering the body almost everywhere, creating appropriate edges across many of the sampling gaps (compare with the holes that remain in the k-NN graph with k = 4, some of which are present even when k = 8). Our weighted graph, G?, can be compared against methods that use a Gaussian-like kernel, and where each point has an individual scale. Some of these methods were described in section 2.1: t-SNE [101], UMAP [77], self-tuning [111], and variable bandwidth [17, 16]; their resulting con-nectivity can be visualized in Figure 24, where edges have intensity proportional to their weight. In Figure 25, we visualize the individual scales resultant from each of these methods. Each σi
+
+is represented, around each point i, as the level set corresponding to a (single-scale) kernel value of 0.75. At the top, we see that the scales found by our kernel seem to nicely conform to the space between each point and its neighbors. Especially illuminating is what happens along the tail, where scales either “expand” or “shrink” so as to minimally cover the spaces between neighboring points; this illustrates what our scale optimization achieves. Among the other methods, with few exceptions, the scales seem to cover either too much (collapsing the tail on itself) or too little (leaving holes in the body). The weighted graphs in Figure 24 reveal the result of the interaction between these individual scales (namely, the edge weights). Our G? (top right) manages to cover almost the entire body with edges, while keeping the tail minimally connected—in fact, resembling the unweighted version in 
+
+G?, and therefore respecting the original curvature and reach. Other methods, in contrast, have a hard time achieving both things with a global value for k. In t-SNE, the scales over the body are much small when k ≤ 4, so its weighted graph looks too sparse; for k ≥ 8, the scales over the tail become too large, and therefore strong edges appear, connecting it to the body. In UMAP, the scales do not grow as much with increasing k, but at k = 4 the body in the weighted graph is still too sparse, while for k ≥ 8 the tail is strongly connected to the body. With the self-tuning, scales seem to grow faster with k, while with variable bandwidth this growth is somewhat counteracted by the action of their global scale,  (equation 7). In fact, the graph that most resembles our own 
+
+G? is the one using the variable bandwidth kernel with k = 2, the main difference being that the big sampling gap near the tip of the body is poorly connected, while in our case it is slightly overly connected (due to connections in G? crossing that gap). 
+
+# 4 Applications 
+
+We now provide examples of application of our kernel to three different manifold learning tasks: dimensionality reduction, by means of a non-linear embedding algorithm; geodesic estimation, which typically finds application in computational geometry, vision, and graphics; and local in-trinsic dimensionality estimation. 40 wgraphs-ok-2 Figure 24: Top: The stingray dataset and the converged graphs, G? and G?; pruned edges are shown in red. Bottom: Other algorithms produce qualitatively different graphs depending on the neighborhood size parameter, k. All graphs shown are weighted (using a continuous kernel) except for the k-nearest neighbors graph (bottom row). Edge weights are visualized as the intensity of the line segments (each wij is divided by the kernel value when rij equals the scale, for a fair comparison across algorithms). 41 Figure 25: Individual scales obtained using our algorithm (top) compared to other methods (bottom table), as represented by their level sets for a (single-scale) kernel value of 0.75. 
+
+## 4.1 Low-dimensional embeddings 
+
+Dimensionality reduction is now ubiquitous in visualization of high-dimensional data. Several methods exist [92, 69, 50, 102], and most of the non-linear methods are manifold-based [88, 96, 87, 57, 10, 36, 112, 28, 109, 101, 95, 77, 80]. Given a collection of points in high-dimensional space sampled from a low-dimensional manifold M, the goal is to find a good parametrization for the data in terms of intrinsic coordinates over M, which in turn can be used to produce a 42 low-dimensional embedding. In surveying the literature, it is common to find a heuristic, or a range of values, suggested for choosing the neighborhood size (see section 2.1), but rarely do we see examples of the sensitivity of the results to that choice. In this section, we ran a few of the most popular methods using a wide range of values for the kernel scale parameter, k, and compared their results to those using our own kernel. We have limited our comparison to some of the embedding methods that use a neighborhood kernel and for which pairwise information is sufficient as input (i.e., do not require positional information): diffusion maps [29, 28], Isomap [96], t-SNE [101], and UMAP [77]. As shown in Figures 26–28, results can vary qualitatively depending on the choice of k. Five values were tested for each dataset, spanning a wide range of scales and different geometries. Next, we summarize each of these methods and their results. 
+
+4.1.1 Diffusion maps + self-tuning kernel 
+
+Diffusion maps are based on the spectral properties of the random walk matrix (normalized graph Laplacian) over the weighted data graph; integration over all paths in the graph makes diffusion distances, in principle, more robust to “short-circuiting” than graph geodesics. For better compari-son with IAN, instead of the standard single-scale Gaussian kernel we use the self-tuning approach of Zelnik-Manor and Perona [111] from equation 6. Our kernel was applied to diffusion maps by directly using G? as similarity matrix (weighted adjacency matrix). We use the diffusion map parameters α = 1 and t = 1 [cf. 28]. With the stingray dataset (Figure 26), we see that the fully-extended tail at k = 2 becomes progressively more folded and compressed as k increases. The body appears contracted at k = 2, but expands with larger k. Using our own G?, although we obtain excellent embeddings of both body and tail (right-most column), they are represented by separate sets of coordinates (two for the body, and a third for the tail), which happens due to the change in dimensionality. Applying self-tuning to the spiral dataset (Figure 27), only k = 2 and k = 4 were able to prevent folding. The bent plane (Figure 28) was more tolerant, with good results for all k except 64, for which the plane remained folded. When using IAN, a good parametrization was obtained for both datasets. 
+
+4.1.2 Variable bandwidth diffusion embedding 
+
+We also tested a variant of diffusion maps using the variable bandwidth kernel of Berry et al. [17], in which a distinct type of multiscale kernel is proposed, along with a specific normalization of the weighted graph Laplacian. Because it computes an additional global scale, , based on the individual scales, in order to apply our algorithm to this method we replaced the density estimates, 
+
+q (equation 7), with the inverse of our optimal scales. We used α = 0 and β = -1/2, as recommended in [16]; eigenvectors were scaled by the square-root of the inverse of their respective eigenvalues [90, 84], following the implementation in [8]. This method produced good embeddings for the stingray, especially for k = 8 (Figure 26). For the spiral (Figure 27), using k ≤ 8 caused some points to drift apart, and although it returned basically the original curve when k = 16 or 32, a spectral algorithm such as this is expected to “unroll” the spiral, finding a good (1-D) parametrization of it. The same happened with the bent 43 grid-embeds Figure 26: Running different embedding algorithms on the “stingray” dataset (see Figure 24). Different choices of the neighborhood size, k, may produce qualitatively different results, depend-ing on the algorithm. Running those same algorithms using the IAN kernel (right) typically gives a reasonable result. Refer to main text for details. plane (Figure 28), which could not be embedded into 2 coordinates for any choice of k. Using our scales, however, the algorithm managed to find appropriate parametrizations for all three datasets. 
+
+4.1.3 Isomap 
+
+Isomap applies classical multidimensional scaling (MDS) to geodesic distances computed as short-est paths over a k-nearest neighbors graph (equation 2). Because the graph is unweighted, this method is particularly sensitive to the choice of k. Our kernel was applied to Isomap by directly replacing the k-NN graph with G?.With the stingray (Figure 26), Isomap produced a good embedding with k = 4. The result 44 with k = 2 was completely wrong (an additional tail appears), and with k = 8 the tip of the tail was disconnected. With k = 16 and k = 32, it essentially returned the original data, without any dimensionality reduction. Our G? improved on the result of k = 4 by making the points in the body more uniformly spread. The spiral (Figure 27) was properly embedded (1-dimensional) only when k ≤ 4. With the bent plane (Figure 28), good results were obtained for k between 4 and 16, but k = 2 produced 1-dimensional curves, and k = 64 did not completely unfold it. Our G? produced the correct mapping in either case. 
+
+4.1.4 t-SNE and UMAP 
+
+t-SNE and UMAP are related methods that have gained popularity in recent years [9]. Both com-pute similarities between data points using individual scales based on log 2 k (section 2.1), and adopt a secondary kernel for computing similarities between embedded points: t-SNE uses a Student t-distribution (Cauchy kernel), while UMAP uses an non-normalized variant requiring a hyperparameter, min dist . In t-SNE, embedding coordinates are initialized at random, while UMAP adopts the strategy of refining an initial spectral embedding. Both then optimize their embeddings by running gradient descent on an information-theoretic cost function between sim-ilarities in input space vs. embedded space: t-SNE minimizes the KL-divergence; UMAP uses a variant of cross-entropy. Alternative initializations are typically used with t-SNE (e.g., PCA) to improve results [63, 72, 64]; in our experiments, for better comparison with UMAP, we used a spectral embedding initial-ization computed from its own symmetrized similarity matrix (equation 10). The IAN kernel was applied to t-SNE by replacing the individual scales (equation 8) with those in σ?; with UMAP, be-cause a different kernel function is used, we directly replaced the weighted graph (with adjacencies given by Uij in equation 12) with G?.We executed t-SNE assigning the various k values to the perplexity parameter, leaving the remaining parameters to their defaults in the scikit-learn implementation [85]. We used the Barnes-Hut method [100] for the cylinder dataset; and the “exact” method for all others. In UMAP, the n neighbors parameter was set to k, with remaining parameters using default values (in particular, min dist = 0.1). Because of the stochastic nature of both algorithms (even when using a fixed initialization), different runs will produce slightly different results. Therefore, in order to avoid “cherry-picking”, both algorithms were executed a single time, using the same random seed. Results for the stingray (Figure 26) were quite analogous between the two algorithms: both produced artificial clustering for k ≤ 8, while for k ≥ 16 the tail began to fuse with the body. The gaps in sampling within the body were accentuated by both algorithms, even at k = 32, where we see a big hole in the UMAP embedding; in t-SNE, it almost breaks into two pieces (despite the large neighborhood size). This example is illustrative of how much an embedding algorithm based on attractive vs. repulsive forces can end up exaggerating nonuniform sampling. The spiral (Figure 27) was disconnected by t-SNE for all values of k except 8. UMAP produced reasonable results for k between 4 and 8; however, for k = 2 a multitude of clusters was obtained, and when k ≥ 16 the curve twisted over itself. Using our kernel (right column) produced a con-nected, non-self-intersecting curve. Neither algorithm was capable of returning a good arc-length parametrization of the spiral, however. With the bent plane (Figure 28), although both algorithms succeeded in unfolding it, t-SNE 45 was only able to produce a fully two-dimensional plane (with no gaps) when setting k = 32 (not shown) or 64, while UMAP required k ≥ 16. Both gave reasonable results using our kernel. Figure27 grid-embeds-spiral 
+
+Figure 27: Running different embedding algorithms on the spiral dataset (top), in which points are sampled from a unit-speed parametrized Archimedean spiral. Different choices of the neigh-borhood size, k, may produce qualitatively different results, depending on the algorithm. Running those same algorithms using the IAN kernel (right) typically gives a reasonable result. Refer to main text for details. 
+
+4.1.5 A higher dimensional example 
+
+Because all of the examples above have d ≤ 2, we also tested our kernel when applied to a higher dimensional manifold, namely a 5-dimensional cylinder ( R1 × S4) with radius 1 and length 3, 46 Figure28 grid-embeds-plane Figure 28: Running different embedding algorithms on the bent plane dataset (top), generated by extending a unit-speed parametrized catenary curve into two dimensions. Different choices of the neighborhood size, k, may produce qualitatively different results, depending on the algorithm. Running those same algorithms using the IAN kernel (right) typically gives a reasonable result. Refer to main text for details. sampled uniformly at random ( N = 8403, ambient space R6). On the other hand, here we used a pure, connected manifold with no bottlenecks and low curvature in order to simplify interpretation. Figure 29 shows two-dimensional embeddings obtained by applying our kernel to different embedding algorithms. Although all correctly produced an oblong, various degrees of mixing of the original color labels were observed, which can be used to qualitatively indicate the quality of the embedding schemes. A quantitative assessment was computed as the rank correlation coefficient, 47 or Kendall’s tau [60, 62] between the ranking (positional order) of each point along the main axis in the original vs. embedded spaces. Both t-SNE and UMAP produced similar or better results when using the IAN kernel (we set 
+
+k = 27 based on the mean degree found in G?, compatible with d = 5; results were robust to this particular choice). Despite their current popularity [e.g., 108, 7, 27, 35, 9, 63, 46, 64, 107], produced considerably jittered outputs, however, implying that the original neighborhoods were not preserved. This appears to be caused by an attempt to reproduce the spherical shape of the cylinder’s base along the main axis, so different “slices” ended up projected on top of one another. However, UMAP produced jittered results even when set to return 6 components (as in the original space) instead of 2. Diffusion maps using IAN resulted in little mixing except near the boundaries, so neighbor-hoods were better preserved. Running it with either self-tuning or variable-bandwidth kernels using k = 27 gave comparable results; Isomap also produced excellent results, with tau = 0.98 (not shown). 48 cylinders Figure 29: Performance of different embedding algorithms on a 5-dimensional cylinder ( R1 ×
+
+S4) sampled uniformly at random ( N = 8403, ambient space R6). Top left: original data, X ,projected onto first 2 coordinates (points colored according to their position along the cylinder’s long axis). Other plots show embeddings using different kernels and/or algorithms. The resulting degree of mixing of the original color labels indicates the quality of the embedding. A quantitative assessment (plots to the left of each embedding) was computed as the rank correlation coefficient, tau (see main text), between the ranking (positional order) of each point along the horizontal axis in the original vs. embedded spaces (a value closer to 1 indicates fewer exchanges in the original order). Use of the IAN kernel produced similar or better results with both t-SNE and UMAP ( k =27 was set based on the mean degree of G?, compatible with d = 5). Diffusion maps resulted in very little mixing except near the boundaries. 
+
+## 4.2 Geodesic computation 
+
+Using the unweighted graph, G?, one may immediately compute graph geodesics (shortest paths using distances in ambient space as edge lengths) to estimate the true geodesics over M [96]. The latter are likely to be underestimated by the former when sampling is sparse [15], even when the graph connectivity is correct, e.g., due to curvature (cf. section 3.3.2). It seems a good idea, then, to incorporate the continuous kernel values present in its weighted counterpart, G?, as a means to possibly improve geodesic estimation. We propose to use the heat method for geodesic computation of Crane et al. [32]. It consists in solving the Poisson equation to find a function, φ, whose gradient follows a unit vector field, X,pointing along geodesics; X can be obtained by normalizing the temperature gradient, ∇u, due to a diffusion process in which heat, u, is allowed to diffuse for a short time. Although this method is 49 tailored to applications where positional information and dimensionality are known (in particular, surfaces in R3), here we apply it to G?, since discrete versions of the operators used (Laplacian, gradient, and divergence) can be readily defined on a weighted graph [see 34]. Despite using pairwise information only, our method produces reasonable estimates, as shown in Figures 30 and 31. To understand why, notice that IAN indirectly solves for a weighted graph for which a random walk starting at node i has a higher probability of reaching a node in its discrete neighborhood, N (i), than any other non-neighboring node. Given that random walks are closely related to diffusion over a graph, one should expect G? to be able to provide reasonable information about how a diffusion process propagates over M. In other words, the Laplacian obtained from G?
+
+should be a good approximation of a continuous operator over M—this is empirically confirmed by our results. In Figure 30, heat geodesics computed from G? for the bent plane dataset approximate well the true geodesics over M, and graph geodesics obtained from G? follow closely. Comparison with those from a naive k-NN graph illustrates that the choice of k is critical (compare with the bottom row of Figure 28). In Figure 31, we compare the results using weighted graphs from various kernels on the stingray dataset; interestingly, heat geodesics computed from G? hold reasonably well even when facing a continuous change in dimensionality. (The diffusion time parameter used by the heat method was optimized for each dataset.) 50 bentplane-geo Figure 30: Geodesic estimation for the bent plane from Figure 28; yellow points are closer to the source (marked with an arrow in the ground truth plot). Top: different views of the data in 3-D, with points colored according to the heat geodesics computed from G?. Middle: Geodesics displayed on an unbent version of the dataset: heat geodesics approximate well the true geodesics over M,and graph geodesics computed from G? follow closely. Bottom: graph geodesics computed from 
+
+k-NN graphs using different choices of k; choosing k = 16 gives near-perfect results, but k = 4 shows distortions, and k = 66 misses completely. 51 Figure31 grid-stingray-geo Figure 31: Geodesics estimated using the heat method applied to G? are close to the ground truth (top). Other kernels yield suboptimal results for most choices of k (bottom); in particular, notice how the tip of the tail is usually inferred to be closer than it should (due to its being directly connected to the body in the underlying graph, cf. Figure 24). Yellow points are closer to the source (marked with an arrow in the ground truth plot). 52 4.3 Local dimensionality estimation 
+
+Intrinsic dimensionality (ID) estimation is tightly associated with dimensionality reduction tasks, especially in manifold learning, where knowledge of d can help, among others, to determine the appropriate number of embedding dimensions. Informally, ID may be seen as the minimum num-ber of parameters required to accurately describe the data. In the context of manifold learning, it is typically equivalent to the topological dimension of M (e.g., a general space curve has dimen-sionality 1 since it requires a single parameter, arc length). There are many different ways to estimate it [24, 25]; global approaches are typically divided into two. The first group is based on some variant of PCA [e.g., 47, 73], and use the number of significant eigenvalues to infer dimensionality; these may be applied globally or by combining local estimates. The second group of methods, termed geometric (or fractal, when a non-integer ID is computed), exploit the geometric relationships in the data, such as neighboring distances. Some are based on estimating packing numbers [58] or on distances to nearest-neighbors [99, 86, 104, 31, 39, 18]. Among the most popular are the correlation dimension methods [26, 51, 55], a variant of which has been specifically applied in the context of determining an appropriate kernel width for manifold learning [see 30, 17, 52]. The dimension is computed as the slope of a log-log plot of the number of neighboring points vs. neighborhood radius (see section 2.1). A recent variation is [61]; others cover the difficult case of high ID [26, 89]. In our scenario, since we do not assume a pure manifold (section 3.1), we focus on local (i.e., pointwise) ID estimation approaches, namely those in which dimension is estimated within a neighborhood around each data point [e.g., 40, 54]. This notion can be formalized as the local Hausdorff dimension [110, 25], and a global estimate is typically found by averaging over local values. A popular approach is the maximum likelihood estimator (MLE) of Levina and Bickel [70], which computes local dimension based on k-nearest neighbors: 
+
+ˆmk(xi) = 
+
+(
+
+1
+
+k − 1
+
+> k
+
+∑
+
+> j=1
+
+log Tk(xi)
+
+Tj (xi)
+
+)
+
+(41) where Tj (xi) denotes the distance between xi and its jth nearest neighbor. We shall use this method in our experiments, in which we compute a final mk(xi) by averaging ˆmk(xi) over i’s neighbors in order to reduce the variance of the local estimates (in the original, this is done over all data points). Notice that our kernel can be readily used with this method by simply replacing the k-NN graph with G?, therefore summing over nodes in the neighborhood N (i) instead of over the k nearest. Additionally, we propose a correlation dimension-based method that allows for local estimates. We describe it next, then compare its results with those from the MLE method. 
+
+4.3.1 Algorithm: Neighborhood Correlation Dimension 
+
+Our proposed method is adapted from the approach from Hein and Audibert [55] [also used in 30, 52, 17], where an estimate of correlation dimension is obtained using a general kernel. It consists in computing a curve, Z(σ), over all pairwise kernel values (e.g., a Gaussian) at different 53 values of the scale parameter σ:
+
+Z =
+
+> N
+
+∑
+
+> i=1
+> N
+
+∑
+
+> j=1
+
+exp − ‖ xi − xj ‖2
+
+2σ2 . (42) As in [30] (and analogous to equations 29–31), by assuming that for small values of σ the manifold 
+
+M looks locally like its tangent space, Rd, we have 
+
+Z ≈ N 2(√2πσ )d
+
+vol 2(M) , (43) which, after taking the logarithm, yields 
+
+log Z ≈ d log σ + log N 2(2 π)d/ 2
+
+vol( M) , (44) so the slope of log Z × log σ can be used to estimate the global dimensionality of the manifold, d.To do so, one typically looks for a region where this slope is most stable, i.e., the curve is approx-imately linear. Automated ways of finding the slope of such a region are: by linear regression of the middle portion of the curve [55] or by taking a point of maximum of Z′(σ) [17, 52]. However, because we assume that intrinsic dimension may vary over M, global averages can-not work in general. Moreover, nonuniform density, curvature, or multiple connected components may all create multiple peaks for Z′(σ), so inspection of the log-log plot cannot be automated. Therefore, we modify this approach to use individual Zi(σ) curves for each data point xi. To keep the summation local, points are restricted to those in the neighborhood of i in G?. Here, it is advantageous to work with an extended neighborhood (e.g., by also including neighbors-of-neighbors) due to the theoretical limit to the value of the dimension d that can be accurately estimated given a set of N points [38], namely d < 2 log 10 N . In fact, if N is large compared to d, even additional hops away from i may be considered. Because such extension is done by following edges in G? (as opposed to naively expanding a ball in Rn), we may thus obtain a larger (approximately tubular) neighborhood around xi without ever leaving the manifold. We denote such a neighborhood N ′(i), as opposed to the immediate neighborhood N (i); throughout this section, both will include the node i itself. Our algorithm involves the following steps: 1. For each data point xi and its extended neighborhood, N ′(i), define Zi as 
+
+Zi(σ) = ∑ 
+
+> j∈|N ′(i)|
+
+exp − ‖ xi − xj ‖2
+
+2σ2 . (45) 2. Analogous to equation 44, by taking the logarithm we have that the slope of the log Zi ×log σ
+
+curve, i.e., 
+
+Z′
+
+> i
+
+(σ) def 
+
+= d log Zi
+
+d log σ , (46) is an estimate of di, the dimension around xi, as a function of σ. Computationally, it is desirable to use the closed-form expression, for accuracy: 
+
+Z′
+
+> i
+
+(σ) = 
+
+∑|N ′(i)| 
+
+> j=1
+
+‖xi − xj ‖2 exp −‖ xi−xj ‖2
+
+> 2σ2
+
+σ2 ∑|N ′(i)| 
+
+> j=1
+
+exp −‖ xi−xj ‖2
+
+> 2σ2
+
+. (47) 54 3. A region of stability of Z′
+
+> i
+
+, i.e., a local maximum, is then an estimate of the dimension around 
+
+xi.A local maximum (“peak”) in Z′
+
+> i
+
+(σ) can be interpreted as follows: as a ball around xi is expanded, the rate at which neighbors are seen has stopped increasing and must decrease with larger σ, since no additional neighbors can be found after the ball encompasses all points in N ′(i).Underlying is the assumption that N ′(i) is sufficiently representative of the manifold around xi.I.e., if neighbors are approximately uniformly distributed and dimensionality is constant within it, then Z′ 
+
+> i
+
+should remain constant over some appreciable range of σ, whence the notion of “stability”. Even though we work with a subset of X , there may still be multiple maxima in Z′
+
+> i
+
+, e.g., when the neighbors of xi are far from uniformly distributed around it. So, operationally, we use the global maximum of Z′
+
+> i
+
+, as this takes into account the information given by the majority of neighboring points. Now, because Zi → 1 as σ → 0, and Zi → N as σ → ∞ , the slope of log Zi
+
+must approach 0 at both extremes, thus the global maximum of Z′ 
+
+> i
+
+must also be a relative one (a “peak”). We now proceed to avoid boundary effects by re-centering neighborhoods . The boundary, 
+
+∂M, of a d-dimensional manifold (when present) has dimensionality d − 1 [68]. The correlation integral approach often fails for these—it typically returns d/ 2 for points in ∂M, since they have roughly half the number of neighbors compared to interior points. For the same reason, it tends to also underestimate d for points near the boundary. Since we work locally over a graph, we can regularize the computation by moving the focus to a more central, nearby point (thus regularizing over sampling artifacts as well): 4. Letting N (i) be the set of adjacent nodes to i in G? and including i itself, define ¯ι as the node 
+
+j ∈ N (i) with smallest median squared distance to all points in the extended neighborhood 
+
+N ′(i):
+
+¯ι = argmin j∈N (i)median {‖xj − xl‖2), ∀l ∈ N ′(i)} . (48) Thus ¯ι is, in effect, the most central node in i’s immediate neighborhood 8.5. Use ¯ι as the point from which kernel values are computed for Zi(σ) by replacing xi with x¯ι in equation 45, thereby shifting the center of estimation of di. This assumes that the dimension does not change abruptly across neighboring points. Denote the resulting estimate by ˆdi.6. As with the MLE method (section 4.3), we may obtain a smoother estimate, ˆd′
+
+> i
+
+, by averaging over immediate neighbors in N (i):
+
+ˆd′ 
+
+> i
+
+= 1
+
+|N (i)|
+
+∑ 
+
+> j∈N (i)
+
+ˆdj . (49) Finally, recall from section 3.5 that we also obtain a degree-based estimate, ˜di, when computing volume ratios (equation 36); we can use this information to further improve our results. A final estimate, d?i , is then obtained as follows:  
+
+> 8Since we know G?, graph-theoretical quantities such as shortest-path betweenness centrality [45, 22] may also be used here.
+
+55 7. To avoid overestimating the true dimension, compute an average ˜d′ 
+
+> i
+
+over N (i) as 
+
+˜d′ 
+
+> i
+
+= 1
+
+|N (i)|
+
+∑ 
+
+> j∈N (i)
+
+⌊ ˜dj
+
+⌋
+
+= 1
+
+|N (i)|
+
+∑ 
+
+> j∈N (i)
+
+blog 2 deg( j)c . (50) 8. Compute the optimal estimate, d?i , as 
+
+d?i = max 
+
+{ ˆd′
+
+> i
+
+, ˜d′
+
+> i
+
+}
+
+. (51) Application of this technique and comparison with other methods are given next. 
+
+4.3.2 Experimental results 
+
+Results of applying our neighborhood correlation dimension (NCD) algorithm compared to Lev-ina & Bickel’s MLE estimator (equation 41) are shown in Figures 32–34. For NCD, we compared results using IAN against those from k-NN graphs using various values of k (a range was chosen that included the best results for each algorithm). The IAN kernel was applied by using the dis-crete neighborhoods of G?, re-centered using neighbors-of-neighbors at most 3 hops away from i
+
+(equation 48). Using IAN, we obtained near-optimal results for the stingray and the bent plane. For the 5-dimensional cylinder, the dimension was underestimated (mean 4.6). Methods based on correlation dimension are known to underestimate the true d when the sample size is not sufficiently large [25]. In these cases, the method of [26] can be applied a posteriori to improve results. For the MLE method, using large values of k tended to improve results, but only when di-mension was constant (as in the bent plane and cylinder datasets). For the stingray, however, no value of k gave correct results: small values of k increased the dimension estimates due to a bias, and large values tended to produce a uniform value throughout (thus giving better estimates only when d is constant). We found that computing the neighborhood averages using the correction of MacKay and Ghahramani [75], i.e., averaging the inverse of the estimators to reduce bias when k
+
+is small, gave slightly better results. (We did not use the final smoothing procedure which involves choosing two additional neighborhood size parameters, k1 and k2.) Finally, we confirmed these observations by testing two additional datasets with non-uniform dimensionality (Figure 35). Again, while our algorithm achieved good results locally, there was no single value of k that allowed MLE to find appropriate local estimates everywhere. 
+
+# 5 Summary and Conclusions 
+
+In theory, applying the manifold assumption requires prior knowledge about the manifold: its geometry, topology, as well as how it was sampled. In practice, however, these manifold properties are rarely known. Instead, one typically imposes an assumption about the manifold’s dimension, d,which in turn suggests that k = 2 d nearest neighbors should suffice. This is how many—most!—of the data graphs underlying manifold inference and non-linear dimensionality reduction are built. Since it is difficult to know whether this assumption about dimension is accurate, it is common practice to test a few values of k and choose among the results. 56 NCD                  
+
+> k= 2 k= 4 k= 8 IAN
+> MLE
+> k= 4 k= 8 k= 16 k= 32
+> 1.0 1.5 2.0 2.5 3.0
+> dimension
+
+Figure 32: Estimation of local intrinsic dimension on the stingray dataset. Top row shows results for our neighborhood correlation dimension (NCD) algorithm using k-NN graphs with various 
+
+k and using adaptive neighborhoods from G? (IAN). Bottom row shows results using Levina & Bickel’s MLE estimator, which was sensitive to the choice of k: using a small value grossly over-estimated the dimension over the body, and a large k ignored the geometry of the tail. NCD using IAN gave the best results, estimating dimension 2 for the body and 1 for the tail, with intermediate values for the transition tail-body and the boundary. Apart from the subjective nature of this choice, there are more general problems. Manifolds may not have a fixed dimension, they may be curved or with boundary, and sampling may vary. The intrinsic dimension may vary across the data, and so should the number of neighbors. In such cases, finding a compromise k may be far from ideal. We suggest a different approach: that one should build the nearest-neighbor graph, and hence the graph-Laplacian approximation, in as data-driven a manner as possible, while being aware of the manifold properties. Our algorithm of iterated adaptive neighborhoods (IAN) starts with a conservative assumption: that nearest neighbors should have no “nearer” neighbors between them. We then alternate between a discrete and a continuous view of neighborhood graphs, and use a volumetric statistic to check for outliers. A linear program keeps the scales minimal while providing a global cover. This optimization is convex, so results are deterministic; other approaches, such as t-SNE and UMAP, are stochastic, so depend critically on the initialization. Our kernel has been applied successfully to a variety of datasets, and compared against some of the most popular algorithms available. In all cases our performance dominates. Furthermore, IAN can be incorporated directly into many embedding algorithms, including diffusion maps, Isomap, UMAP, and t-SNE, improving their results. Most of these algorithms involve several free parame-ters; we have none other than the robust requirement for an outlier. 57 NCD                   
+
+> k= 4 k= 16 k= 64 IAN
+> MLE
+> k= 8 k= 16 k= 32 k= 64
+> 1.0 1.5 2.0 2.5 3.0 3.5
+> dimension
+
+Figure 33: Estimation of local intrinsic dimension on the bent plane dataset. As with the stingray (Figure 32), results are sensitive to the choice of k, but here a wider range of values work due to the constant dimension. For NCD, results with IAN are comparable to those using the best k-NN graph ( k = 16). With MLE, larger k improved results (comparable to those using NCD). 2 3 4 5local dimension counts NCD       
+
+> k=8
+> k=16
+> k=32
+> IAN
+> 2345678
+
+local dimension MLE 
+
+> k=8 k=16 k=32
+
+Figure 34: Estimation of local intrinsic dimension for the 5-D cylinder dataset of Figure 29. With NCD, results using IAN underestimated the true dimensionality (mean 4.63), but are still better than using a k-NN graph with arbitrary k. With MLE, larger values of k gave tighter distributions centered near the correct value (mean 4.86 for k = 32). Other popular embedding algorithms, e.g., LLE [88], approximate the tangent space over a 58 other-dims Figure 35: Dimensionality estimation for two datasets with non-uniform local di: a “tiara” (top row), where dimension varies smoothly from 2 to 1, and a “spinning top” (bottom row, middle cross-section shown), where dimension reduces from 3 to 1 as one moves from the bulky part toward the tip. Using the optimal k for MLE could not produce good results for the entire dataset (here, k = 32 for both datasets). The NCD method, in contrast, was able to correctly adapt to the local geometry by taking advantage of the data graph produced by the IAN kernel. local neighborhood around each point. Although not explored here, using G? to automatically pro-vide such neighborhoods is straightforward (analogous to what was done in section 4.3 to estimate the local dimensionality). Applications to clustering need to be explored. Our weighted graph has also been applied to geodesic estimation, achieving comparable results to those obtained from graph geodesics. In contrast, the graphs obtained from other similarity kernels produced less than optimal results. Our unweighted graph has found application in local dimensionality estimation. Our proposed algorithm, neighborhood correlation dimension (NCD), takes advantage of the adaptive connec-tivity of our graph to improve results based on correlation dimension, namely by restricting the correlation integral to an approximately tubular neighborhood around xi in M. As a result, we 59 obtained accurate estimates of the local dimension in datasets where it is not uniform. Several theoretical bounds are implied throughout this paper; these need to be proved. Multi-scale kernels, such as those from equations 6 and 7, are known to approximate Laplacian operators asymptotically [98, 16]. Using our application examples as evidence, we conjecture that our ver-sion also results in good approximations. In conclusion, understanding the interplay between manifold geometry, topology, and sampling lies at the heart of many data science applications. We have taken a first step to illustrate how discrete relates to continuous, how local estimates relate to global ones, and how uncertainties in data gathering relate to both. Applying data science in a way that leads to rigorous, scientifically-appropriate conclusions must take all of these into account. 
+
+# A Greedy splitting 
+
+As an alternative to the optimization from section 3.4 (which can be expensive when the number of edges in G is very large, mainly due to large dimensionality), we have developed a greedy approach in which scales that “ C-cover” each edge eij are assigned in decreasing order of length, 
+
+rij (the Euclidean distance between xi and xj in Rn). We call this algorithm greedy splitting .Starting with the edge eij with largest rij , set σi = σj = Cr ij , with C ≤ 1, thereby satisfying 
+
+σiσj ≥ (Cr ij )2 with equality—we say Cr ij is evenly “split” between σi and σj . Moreover, since 
+
+rij = rFN  
+
+> i
+
+= rFN  
+
+> j
+
+, we know the constraints σi ≤ rFN  
+
+> i
+
+and σj ≤ rFN  
+
+> j
+
+are also satisfied. Continue with the edge eij that has the next largest length, rij . Here we are met with three possible cases in which a (re)assignment of scales is needed: 1. If neither of the nodes have been assigned a scale yet, evenly split the scaled distance between 
+
+σi and σj , as above. 2. If one of the nodes does not have a scale yet (without loss of generality, let that node be j), set σ′ 
+
+> j
+
+to the minimum scale that ensures σiσ′ 
+
+> j
+
+≥ (Cr ij )2, i.e., σ′ 
+
+> j
+
+= ( Cr ij )2/σ i;3. If both nodes have previously been assigned a scale but eij is not C-covered by the current values of σi and σj , then set the quotient a = Cr ij  
+
+> √σiσj
+
+and update both scales: σ′ 
+
+> i
+
+= aσ i and 
+
+σ′ 
+
+> j
+
+= aσ j , thereby evenly splitting the quotient between the two nodes. After cases (2) and (3), the updated scales might need to be “rebalanced” in order to meet the constraints σ′ 
+
+> i
+
+≤ rFN  
+
+> i
+
+and σ′ 
+
+> j
+
+≤ rFN  
+
+> j
+
+. Without loss of generality, let σ′ 
+
+> i
+
+> r FN  
+
+> i
+
+. Then, we set 
+
+σ′′  
+
+> i
+
+= rFN  
+
+> i
+
+and σ′′  
+
+> j
+
+= σ′
+
+> jσ′
+> i
+> σ′′
+> i
+
+. Only one of the two scales may exceed its upper bound: in (2), this is trivially true since only the newly-assigned scale may be greater than Cr ij ; in (3), since both σi
+
+and σj have been previously assigned, we have σi ≤ rFN  
+
+> i
+
+and σj ≤ rFN  
+
+> j
+
+, as well as rij ≤ rFN  
+
+> i
+
+and 
+
+rij ≤ rFN  
+
+> j
+
+, so therefore it must be the case that rFN  
+
+> i
+
+rFN  
+
+> j
+
+≥ r2 
+
+> ij
+
+= σ′
+
+> i
+
+σ′ 
+
+> j
+
+. Note that, as a corollary, both scales must meet their respective constraints after being re-balanced as above. The above is repeated until all edges have been visited. By covering the largest edges first, we assign the largest, most constrained scales first, allowing for the later, less constrained scales, to be as small as possible. Because in most cases this tends to evenly split the scaled edge lengths 
+
+Cr ij between σi and σj , the algorithm produces reasonable (but usually sub-optimal) results when compared to the linear program of section 3.4.2. 60 References 
+
+[1] E. Aamari, J. Kim, F. Chazal, B. Michel, A. Rinaldo, and L. Wasserman. Estimating the reach of a manifold. Electronic Journal of Statistics , 13(1):1359–1399, 2019. [2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM , 44(4):615–631, 1997. [3] A. ´Alvarez-Meza, J. Valencia-Aguirre, G. Daza-Santacoloma, and G. Castellanos-Dom´ ınguez. Global and local choice of the number of nearest neighbors in locally linear embedding. Pattern Recognition Letters , 32(16):2171–2177, 2011. [4] N. Amenta and M. Bern. Surface reconstruction by Voronoi filtering. Discrete & Compu-tational Geometry , 22(4):481–504, 1999. [5] N. Amenta, M. Bern, and M. Kamvysselis. A new Voronoi-based surface reconstruction algorithm. In Proceedings of the 25th Annual Conference on Computer Graphics and Inter-active Techniques , SIGGRAPH, pages 415–421, 1998. [6] S. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians. The Annals of Applied Probability , 15(1A):69–92, 2005. [7] S. Arora, W. Hu, and P. K. Kothari. An analysis of the t-SNE algorithm for data visualiza-tion. In Conference On Learning Theory , pages 1455–1462. PMLR, 2018. [8] R. Banisch, E. H. Thiede, and Z. Trstanova. pydiffmap. https://github.com/ DiffusionMapsAcademics/pyDiffMap , 2017. [9] E. Becht, L. McInnes, J. Healy, C.-A. Dutertre, I. W. H. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell. Dimensionality reduction for visualizing single-cell data using UMAP. 
+
+Nature biotechnology , 37(1):38–44, 2019. [10] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-sentation. Neural Computation , 15(6):1373–1396, 2003. [11] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine learning , 56(1):209–239, 2004. [12] M. Belkin, J. Sun, and Y. Wang. Discrete Laplace operator on meshed surfaces. In Proceed-ings of the 24th Annual Symposium on Computational Geometry , pages 278–287, 2008. [13] M. Belkin, J. Sun, and Y. Wang. Constructing Laplace operator from point clouds in R d.In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete algorithms , pages 1031–1040. SIAM, 2009. [14] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin. The ball-pivoting algorithm for surface reconstruction. IEEE transactions on Visualization and Computer Graphics , 5(4):349–359, 1999. 61 [15] M. Bernstein, V. de Silva, J. C. Langford, and J. B. Tenenbaum. Graph approximations to geodesics on embedded manifolds (Technical Report), 2000. [16] T. Berry and J. Harlim. Variable bandwidth diffusion kernels. Applied and Computational Harmonic Analysis , 40(1):68–96, 2016. ISSN 1063-5203. [17] T. Berry, D. Giannakis, and J. Harlim. Nonparametric forecasting of low-dimensional dy-namical systems. Physical Review E , 91(3):032915, 2015. [18] A. Block, Z. Jia, Y. Polyanskiy, and A. Rakhlin. Intrinsic dimension estimation. Journal of Machine Learning Research , 22:1–30, 2021. [19] J.-D. Boissonnat, L. J. Guibas, and S. Y. Oudot. Manifold reconstruction in arbitrary dimen-sions using witness complexes. Discrete & Computational Geometry , 42(1):37–70, 2009. [20] J.-D. Boissonnat, A. Lieutier, and M. Wintraecken. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. Journal of Applied and Computational Topol-ogy , 3(1):29–58, 2019. [21] S. P. Boyd and L. Vandenberghe. Convex optimization . Cambridge University Press, 2004. [22] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Soci-ology , 25(2):163–177, 2001. [23] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. 
+
+Advances in Neural Information Processing Systems , 7, 1994. [24] F. Camastra. Data dimensionality estimation methods: a survey. Pattern Recognition , 36 (12):2945–2954, 2003. [25] F. Camastra and A. Staiano. Intrinsic dimension estimation: Advances and open problems. 
+
+Information Sciences , 328:26–41, 2016. [26] F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(10): 1404–1407, 2002. [27] D. M. Chan, R. Rao, F. Huang, and J. F. Canny. t-SNE-CUDA: GPU-accelerated t-SNE and its applications to modern data. In 30th International Symposium on Computer Architecture and High Performance Computing , SBAC-PAD, pages 330–338. IEEE, 2018. [28] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analy-sis , 21(1):5–30, 2006. [29] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffu-sion maps. Proceedings of the National Academy of Sciences , 102(21):7426–7431, 2005. 62 [30] R. R. Coifman, Y. Shkolnisky, F. J. Sigworth, and A. Singer. Graph Laplacian tomography from unknown random projections. IEEE Transactions on Image Processing , 17(10):1891– 1899, 2008. [31] J. A. Costa, A. Girotra, and A. O. Hero. Estimating local intrinsic dimension with k-nearest neighbor graphs. In IEEE/SP 13th Workshop on Statistical Signal Processing , pages 417– 422. IEEE, 2005. [32] K. Crane, C. Weischedel, and M. Wardetzky. Geodesics in heat: A new approach to com-puting distance based on heat flow. ACM Transactions on Graphics , 32(5):1–11, 2013. [33] S. Dasgupta. Learning mixtures of Gaussians. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science , FOCS, pages 634–644, USA, 1999. IEEE Computer Society. ISBN 0769504094. [34] X. Desquesnes, A. Elmoataz, and O. L´ ezoray. Eikonal equation adaptation on weighted graphs: fast geometric diffusion process for local and non-local image and data processing. 
+
+Journal of Mathematical Imaging and Vision , 46(2):238–257, 2013. [35] G. Dimitriadis, J. P. Neto, and A. R. Kampff. t-SNE visualization of large-scale neural recordings. Neural Computation , 30(7):1750–1774, 2018. [36] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences , 100(10):5591– 5596, 2003. [37] R. Dyer, H. Zhang, and T. M¨ oller. Gabriel meshes and Delaunay edge flips. In 2009 SIAM/ACM Joint Conference on Geometric and Physical Modeling , pages 295–300, 2009. [38] J.-P. Eckmann and D. Ruelle. Fundamental limitations for estimating dimensions and lya-punov exponents in dynamical systems. Physica D: Nonlinear Phenomena , 56(2-3):185– 187, 1992. [39] E. Facco, M. d’Errico, A. Rodriguez, and A. Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports , 7(1):1–8, 2017. [40] A. M. Farahmand, C. Szepesv´ ari, and J.-Y. Audibert. Manifold-adaptive dimension esti-mation. In Proceedings of the 24th International Conference on Machine Learning , pages 265–272, 2007. [41] H. Federer. Curvature measures. Transactions of the American Mathematical Society , 93 (3):418–491, 1959. [42] C. Fefferman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society , 29(4):983–1049, 2016. [43] C. Fefferman, S. Ivanov, Y. Kurylev, M. Lassas, and H. Narayanan. Fitting a putative mani-fold to noisy data. In Proceedings of the 31st Conference On Learning Theory , volume 75, pages 688–720. PMLR, 2018. 63 [44] S. Fortune. Voronoi diagrams and Delaunay triangulations. Computing in Euclidean Geom-etry , pages 225–265, 1995. [45] L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry , pages 35–41, 1977. [46] Y. Fujiwara, Y. Ida, S. Kanai, A. Kumagai, and N. Ueda. Fast similarity computation for t-SNE. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) , pages 1691–1702. IEEE, 2021. [47] K. Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimensionality of data. 
+
+IEEE Transactions on Computers , 100(2):176–183, 1971. [48] K. R. Gabriel and R. R. Sokal. A new statistical approach to geographic variation analysis. 
+
+Systematic Zoology , 18(3):259–278, 1969. [49] C. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Minimax manifold estimation. Journal of Machine Learning Research , 13(43):1263–1291, 2012. [50] Y. Goldberg and Y. Ritov. Local procrustes for manifold embedding: a measure of embed-ding quality and embedding algorithms. Machine learning , 77(1):1–25, 2009. [51] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. In The theory of chaotic attractors , pages 170–189. Springer, 2004. [52] L. Haghverdi, F. Buettner, and F. J. Theis. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics , 31(18):2989–2998, 2015. [53] G. Haro, G. Randall, and G. Sapiro. Translated Poisson mixture model for stratification learning. International Journal of Computer Vision , 80(3):358–374, 2008. [54] J. He, L. Ding, L. Jiang, Z. Li, and Q. Hu. Intrinsic dimensionality estimation based on manifold assumption. Journal of Visual Communication and Image Representation , 25(5): 740–747, 2014. [55] M. Hein and J.-Y. Audibert. Intrinsic dimensionality estimation of submanifolds in Rd.In Proceedings of the 22nd international conference on Machine learning , pages 289–296, 2005. [56] M. Hein and M. Maier. Manifold denoising. Advances in Neural Information Processing Systems , 19, 2006. [57] G. E. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Informa-tion Processing Systems , 15, 2002. [58] B. K´ egl. Intrinsic dimension estimation using packing numbers. Advances in Neural Infor-mation Processing Systems , 15, 2002. [59] Y. Keller, R. R. Coifman, S. Lafon, and S. W. Zucker. Audio-visual group recognition using diffusion maps. IEEE Transactions on Signal Processing , 58(1):403–413, 2009. 64 [60] M. G. Kendall. Rank correlation methods . Griffin, 1948. [61] M. Kleindessner and U. Luxburg. Dimensionality estimation without distances. In Pro-ceedings of the eighteenth International Conference on Artificial Intelligence and Statistics ,volume 38, pages 471–479, San Diego, California, USA, 09–12 May 2015. PMLR. [62] W. R. Knight. A computer method for calculating kendall’s tau with ungrouped data. Jour-nal of the American Statistical Association , 61(314):436–439, 1966. [63] D. Kobak and P. Berens. The art of using t-SNE for single-cell transcriptomics. Nature communications , 10(1):1–14, 2019. [64] D. Kobak and G. C. Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature biotechnology , 39(2):156–157, 2021. [65] O. Kouropteva, O. Okun, and M. Pietik¨ ainen. Selection of the optimal parameter value for the locally linear embedding algorithm. FSKD , 2:359–363, 2002. [66] S. Lafon. Diffusion maps and geometric harmonics . PhD thesis, Yale University, 2004. [67] S. Lafon, Y. Keller, and R. R. Coifman. Data fusion and multicue data matching by diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence , 28(11):1784–1797, 2006. [68] J. Lee. Introduction to topological manifolds , volume 202. Springer Science & Business Media, 2010. [69] J. A. Lee and M. Verleysen. Nonlinear dimensionality reduction , volume 1. Springer, 2007. [70] E. Levina and P. Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems , 17, 2004. [71] O. Lindenbaum, M. Salhov, A. Yeredor, and A. Averbuch. Gaussian bandwidth selection for manifold learning and classification. Data Mining and Knowledge Discovery , 34(6): 1676–1712, 2020. [72] G. C. Linderman, M. Rachh, J. G. Hoskins, S. Steinerberger, and Y. Kluger. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods , 16(3):243–245, 2019. [73] A. V. Little, M. Maggioni, and L. Rosasco. Multiscale geometric methods for data sets I: multiscale SVD, noise and curvature. Applied and Computational Harmonic Analysis , 43 (3):504–567, 2017. [74] L. Lov´ asz. Discrete and continuous: two sides of the same? In Visions in Mathematics ,pages 359–382. Springer, 2010. [75] D. MacKay and Z. Ghahramani. Comments on ‘Maximum likelihood estimation of intrinsic dimension’ by E. Levina and P. Bickel (2005). The Inference Group Website, Cavendish Laboratory, Cambridge University , 2005. URL http://www.inference.org.uk/ mackay/dimension/ .65 [76] D. W. Matula and R. R. Sokal. Properties of Gabriel graphs relevant to geographic variation research and the clustering of points in the plane. Geographical Analysis , 12(3):205–222, 1980. [77] L. McInnes, J. Healy, N. Saul, and L. Großberger. UMAP: Uniform manifold approximation and projection. Journal of Open Source Software , 3(29):861, 2018. [78] N. Mekuz and J. K. Tsotsos. Parameterless Isomap with adaptive neighborhood selection. In Joint Pattern Recognition Symposium , pages 364–373. Springer, 2006. [79] G. Mishne and I. Cohen. Multiscale anomaly detection using diffusion maps. IEEE Journal of selected topics in signal processing , 7(1):111–123, 2012. [80] K. R. Moon, D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt, W. S. Chen, K. Yim, A. v. d. Elzen, M. J. Hirn, R. R. Coifman, N. Ivanova, G. Wolf, and S. Krishnaswamy. Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology , 37(12): 1482–1492, 2019. [81] H. Narayanan and S. Mitter. Sample complexity of testing the manifold hypothesis. Ad-vances in Neural Information Processing Systems , 23, 2010. [82] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry , 39(1):419–441, 2008. [83] P. Niyogi, S. Smale, and S. Weinberger. A topological view of unsupervised learning from noisy data. SIAM Journal on Computing , 40(3):646–663, 2011. [84] F. No´ e, R. Banisch, and C. Clementi. Commute maps: separating slowly mixing molecular configurations for kinetic modeling. Journal of Chemical Theory and Computation , 12(11): 5620–5630, 2016. [85] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research , 12:2825–2830, 2011. [86] K. W. Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes. An intrinsic dimensionality estima-tor from near-neighbor information. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1(1):25–37, 1979. [87] S. Roweis, L. Saul, and G. E. Hinton. Global coordination of local linear models. Advances in Neural Information Processing Systems , 14, 2001. [88] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embed-ding. Science , 290(5500):2323–2326, 2000. [89] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel high intrinsic dimensionality estimators. Machine learning , 89(1):37–65, 2012. 66 [90] M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its relationships to spectral clustering. In European Conference on Machine Learning ,pages 371–383. Springer, 2004. [91] O. Samko, A. D. Marshall, and P. L. Rosin. Selection of the optimal parameter value for the Isomap algorithm. Pattern Recognition Letters , 27(9):968–979, 2006. [92] L. K. Saul, K. Q. Weinberger, F. Sha, J. Ham, and D. D. Lee. Spectral methods for dimen-sionality reduction. Semi-supervised Learning , 3, 2006. [93] B. Sch¨ olkopf and A. J. Smola. Learning with kernels: support vector machines, regulariza-tion, optimization, and beyond . MIT press, 2002. [94] D. Spielman. Spectral graph theory. Combinatorial Scientific Computing , 18, 2012. [95] J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web , pages 287–297, 2016. [96] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for non-linear dimensionality reduction. Science , 290(5500):2319–2323, 2000. [97] C. Th¨ ale. 50 years sets with positive reach—a survey. Surveys in Mathematics and its Applications , 3:123–165, 2008. [98] D. Ting, L. Huang, and M. I. Jordan. An analysis of the convergence of graph Laplacians. In Proceedings of the 27th International Conference on International Conference on Ma-chine Learning , ICML, pages 1079–1086, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077. [99] G. V. Trunk. Statistical estimation of the intrinsic dimensionality of a noisy signal collection. 
+
+IEEE Transactions on Computers , 100(2):165–171, 1976. [100] L. van der Maaten. Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research , 15(1):3221–3245, 2014. [101] L. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research , 9(11), 2008. [102] L. van der Maaten, E. Postma, and J. van den Herik. Dimensionality reduction: a compara-tive review. Journal of Machine Learning Research , 10(66-71):13, 2009. [103] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences , 68(4):841–860, 2004. [104] P. J. Verveer and R. P. W. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Transactions on Pattern Analysis and Machine Intelligence , 17(1):81–86, 1995. 67 [105] X. Wan, W. Wang, J. Liu, and T. Tong. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Medical Research Methodology , 14(1):1–13, 2014. [106] J. Wang, Z. Zhang, and H. Zha. Adaptive manifold learning. Advances in Neural Informa-tion Processing Systems , 17, 2004. [107] Y. Wang, H. Huang, C. Rudin, and Y. Shaposhnik. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. Journal of Machine Learning Research , 22(201):1–73, 2021. [108] M. Wattenberg, F. Vi´ egas, and I. Johnson. How to use t-SNE effectively. Distill , 1(10):e2, 2016. [109] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision , 70(1):77–90, 2006. [110] L.-S. Young. Dimension, entropy and lyapunov exponents. Ergodic theory and dynamical systems , 2(1):109–124, 1982. [111] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Proceedings of the 17th International Conference on Neural Information Processing Systems , NIPS, pages 1601– 1608, Cambridge, MA, USA, 2004. MIT Press. [112] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality reduction via tan-gent space alignment. SIAM Journal on Scientific Computing , 26(1):313–338, 2004. 68
diff --git a/pinn/docs/evidence/wang2023_expert_guide_pinn.md b/pinn/docs/evidence/wang2023_expert_guide_pinn.md
new file mode 100644
index 0000000..69c1596
--- /dev/null
+++ b/pinn/docs/evidence/wang2023_expert_guide_pinn.md
@@ -0,0 +1,1490 @@
+Title: 2308.08468v1.pdf
+
+URL Source: https://arxiv.org/pdf/2308.08468
+
+Published Time: Sat, 19 Aug 2023 00:01:03 GMT
+
+Number of Pages: 36
+
+Markdown Content:
+# AN EXPERT ’S GUIDE TO TRAINING PHYSICS -INFORMED 
+
+# NEURAL NETWORKS 
+
+Sifan Wang 
+
+Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104 
+
+sifanw@sas.upenn.edu 
+
+Shyam Sankaran 
+
+Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 
+
+shyamss@seas.upenn.edu 
+
+Hanwen Wang 
+
+Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104 
+
+wangh19@sas.upenn.edu 
+
+Paris Perdikaris 
+
+Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 
+
+pgp@seas.upenn.edu 
+
+# ABSTRACT 
+
+Physics-informed neural networks (PINNs) have been popularized as a deep learning framework that can seamlessly synthesize observational data and partial differential equation (PDE) constraints. Their practical effectiveness however can be hampered by training pathologies, but also oftentimes by poor choices made by users who lack deep learning expertise. In this paper we present a series of best practices that can significantly improve the training efficiency and overall accuracy of PINNs. We also put forth a series of challenging benchmark problems that highlight some of the most prominent difficulties in training PINNs, and present comprehensive and fully reproducible ablation studies that demonstrate how different architecture choices and training strategies affect the test accuracy of the resulting models. We show that the methods and guiding principles put forth in this study lead to state-of-the-art results and provide strong baselines that future studies should use for comparison purposes. To this end, we also release a highly optimized library in JAX that can be used to reproduce all results reported in this paper, enable future research studies, as well as facilitate easy adaptation to new use-case scenarios. 
+
+# 1 Introduction 
+
+Recent advances in deep learning have revolutionized fields such as computer vision, natural language processing and reinforcement learning [ 1, 2, 3]. Powered by rapid growth in computational resources, deep neural networks are also increasingly used for modeling and simulating physical systems. Examples of these include weather forecasting [4, 5, 6], quantum chemistry [7, 8] and protein structure prediction [9]. Notably, the fusion of scientific computing and machine learning has led to the emergence of physics-informed neural networks (PINNs) [ 10 ], an emerging paradigm for tackling forward and inverse problems involving partial differential equations (PDEs). These deep learning models are known for their capability to seamlessly incorporate noisy experimental data and physical laws into the learning process. This is typically accomplished by parameterizing unknown functions of interest using deep neural networks and formulating a multi-task learning problem with the aim of matching observational data and approximating an underlying PDE system. Over the last couple of years, PINNs have led to a series of promising results across a range of problems in computational science and engineering, including fluids mechanics [ 11 , 12 , 13 ], bio-engineering [ 14 , 15 ], materials [ 16 , 17 , 18 ], molecular dynamics [ 19 ], electromagnetics [20, 21], geosciences [22, 23], and the design of thermal systems [24, 25]. 
+
+> arXiv:2308.08468v1 [cs.LG] 16 Aug 2023
+
+Despite some empirical success, PINNs are still facing many challenges that define open areas for research and further methodological advancements. In recent years, there have been numerous studies focusing on improving the performance of PINNs, mostly by designing more effective neural network architectures or better training algorithms. For example, loss re-weighting schemes have emerged as a prominent strategy for promoting a more balanced training process and improved test accuracy [ 26 , 27 , 28 , 29 ]. Other efforts aim to achieve similar goals by adaptively re-sampling collocation points, such as importance sampling [ 30 ], evolutionary sampling [ 31 ] and residual-based adaptive sampling [32 ]. Considerable efforts have also been dedicated towards developing new neural network architectures to improve the representation capacity of PINNs. Examples include the use of adaptive activation functions [ 33 ], positional embbedings [ 34 , 35 ], and novel architectures [ 26 , 36 , 37 , 38 , 39 , 40 ]. Another research avenue explores alternative objective functions for PINNs training, beyond the weighted summation of residuals [ 41 ]. Some approaches incorporate numerical differentiation [ 42 ], while others draw inspiration from Finite Element Methods (FEM), adopting variational formulations [ 43 , 44 ]. Other approaches propose adding additional regularization terms to accelerate training of PINNs [45, 46]. Lastly, the evolution of training strategies has been an area of active research. Techniques such as sequential training [ 47 , 48 ] and transfer learning [ 49 , 50 , 51 ] have shown potential in speeding up the learning process and yielding better predictive accuracy. While new research on PINNs is currently being produced at high frequency, a suite of common benchmarks and baselines is still missing from the literature. Almost all existing studies put forth their own collection of benchmark examples, and typically compare against the original PINNs formulation put forth by Raissi et al. , which is admittedly a weak baseline. This introduces many difficulties in systematically assessing progress in the field, but also in determining how to use PINNs from a practitioner’s standpoint. To address this gap, this work proposes a training pipeline that seamlessly integrates recent research developments to effectively resolve the identified issues in PINNs training, including spectral bias [ 52 , 35 ], unbalanced back-propagated gradients [ 26 , 27 ] and causality violation [ 53 ]. In addition, we present a variety of techniques that could further enhance performance, shedding light on some practical tips that form a guideline for selecting hyper-parameters. This is accompanied by an extensive suite of fully reproducible ablation studies performed across a wide range of benchmarks. This allows us to identify the setups that consistently yield the state-of-the-art results, which we believe should become the new baseline that future studies should compare against. We also release a high-performance library in JAX that can be used to reproduce all findings reported in this work, enable future research studies, as well as facilitate easy adaptation to new use-case scenarios. As such, we believe that this work can equally benefit researchers and practitioners to further advance PINNs and deploy them in more realistic application settings. The rest of this paper is organized as follows. In section 2, we provide a brief overview of the original formulation of PINNs as introduced by Raissi et al. [ 10 ], and outline our training pipeline. From Section 3 to Section 5, we delve into the motivation and implementation details of the key components of the proposed algorithm. These consist of non-dimensionalization, network architectures that employ Fourier feature embeddings and random weight factorization, as well as training algorithms such as causal training, curriculum training and loss weighting strategies. Section 6 discusses various aspects of PINNs that lead to improved stability and superior training performance Finally, in section 7 we validate the effectiveness and robustness of the proposed pipeline across a wide range of benchmarks and showcase state-of-the-art results. 
+
+# 2 Physics-informed Neural Networks 
+
+Following the original formulation of Raissi et al. , we begin with a brief overview of physics-informed neural networks (PINNs) [ 10 ] in the context of solving partial differential equations (PDEs). Generally, we consider PDEs taking the form 
+
+ut + N [u] = 0 , t ∈ [0 , T ], x ∈ Ω, (2.1) subject to the initial and boundary conditions 
+
+u(0 , x) = g(x), x ∈ Ω, (2.2) 
+
+B[u] = 0 , t ∈ [0 , T ], x ∈ ∂Ω, (2.3) where N [·] is a linear or nonlinear differential operator, and B[·] is a boundary operator corresponding to Dirichlet, Neumann, Robin, or periodic boundary conditions. In addition, u describes the unknown latent solution that is governed by the PDE system of Equation (2.1). We proceed by representing the unknown solution u(t, x) by a deep neural network uθ (t, x), where θ denotes all tunable parameters of the network (e.g., weights and biases). This allows us to define the PDE residuals as 
+
+Rθ (t, x) = ∂uθ
+
+∂t (tr , xr ) + N [uθ ]( tr , xr ) (2.4) 2Figure 1: Illustration of the proposed training pipeline. The procedure begins with the non-dimensionalization of the PDE system, ensuring that input and output variables are in a reasonable range. Subsequently, an appropriate network architecture is constructed to represent the unknown PDE solution. The use of Fourier feature embeddings and random weight factorization is highly recommended for mitigating spectral bias and accelerating convergence. The training phase of the PINN model integrates various advanced algorithms, including self-adaptive loss balancing, causal training , and curriculum training. Then, a physics-informed model can be trained by minimizing the following composite loss function 
+
+L(θ) = Lic (θ) + Lbc (θ) + Lr (θ), (2.5) where 
+
+Lic (θ) = 1
+
+Nic Nic X
+
+> i=1
+
+uθ (0 , xiic ) − g(xiic ) 2 , (2.6) 
+
+Lbc (θ) = 1
+
+Nbc Nbc X
+
+> i=1
+
+B[uθ ]( tibc , xibc ) 2 , (2.7) 
+
+Lr (θ) = 1
+
+NrNrX
+
+> i=1
+
+Rθ (tir , xir ) 2 . (2.8) Here {xiic }Nic 
+
+> i=1
+
+, {tibc , xibc }Nbc  
+
+> i=1
+
+and {tir , xir }Nr 
+
+> i=1
+
+can be the vertices of a fixed mesh or points that are randomly sampled at each iteration of a gradient descent algorithm. Notice that all required gradients with respect to input variables or network parameters θ can be efficiently computed via automatic differentiation [54]. However, as demonstrated by recent work, several critical training pathologies prevent PINNs from yielding accurate and robust results. These pathologies include spectral bias [ 52 , 35 ], causality violation [ 53 ], and unbalanced back-propagated gradients among different loss terms [ 26 ], etc. To address these issues, we propose a training pipeline that integrates key recent advancements, which we believe are indispensable for the successful implementation of PINNs. As shown in Figure 1, the pipeline consists of three main steps, PDE non-dimensionalization, choosing suitable network architectures and employing appropriate training algorithms. Further details are provided in Algorithm 2. In the following sections, we will carefully demonstrate the motivation and necessity of each component in the proposed algorithm and validate its effectiveness via a wide range of benchmarks. 
+
+# 3 Non-dimensionalization 
+
+It is well-known that data normalization is an important pre-processing step in traditional deep learning, which typically involves scaling the input features of a data-set so that they have similar magnitudes and ranges [ 55 , 56 ]. However, this process may not be generally applicable for PINNs as the target solutions are typically not available when solving forward PDE problems. In such cases, it is important to ensure that the target output variables vary within a reasonable range. One way to achieve this is through non-dimensionalization. It is a common technique used in mathematics and physics. to simplify and analyze complex systems by transforming the original system into an equivalent dimensionless system. This is performed by selecting one or more fundamental units or characteristic values, and scaling the variables in the problem so that they become dimensionless and of order one. From our experience, non-dimensionalization plays a crucial role in building physics-informed models especially for dealing with experimental data or real-world problems. The reasons are shown below: • Lack of consistent network initialization schemes: The initialization of neural networks has a crucial role on the effectiveness of gradient descent algorithms. Common initialization schemes (e.g., Glorot [ 55 ]) not 3Algorithm 1 Training pipeline of physics-informed neural networks 1. Non-dimensionalize the PDE system (2.1). 2. Represent the PDE solution by a multi-layer perceptron network (MLP) uθ with Fourier feature embeddings and random weight factorization. In general, we recommend using tanh activation and initialized using the Glorot scheme. 3. Formulate the weighted loss function according to the PDE system: 
+
+L(θ) = λic Lic (θ) + λbc Lbc (θ) + λr Lr (θ), (2.9) where Lic (θ) and Lbc (θ) are defined in (2.6), (2.7) respectively, and 
+
+Lr (θ) = 1
+
+M
+
+> M
+
+X
+
+> i=1
+
+wiLir (θ). (2.10) Here we partition the temporal domain into M equal sequential segments and introduce Lir to denote the PDE residual loss within the i-th segment of the temporal domain. 4. Set all global weights λic , λ bc , λ r and temporal weights {wi}Mi=1 to 1. 5. Use S steps of a gradient descent algorithm to update the parameters θ as: 
+
+for n = 1 , . . . , S do 
+
+(a) Randomly sample {xiic }Nic 
+
+> i=1
+
+, {tibc , xibc }Nbc  
+
+> i=1
+
+and {tir , xir }Nr 
+
+> i=1
+
+in the computational domain and evaluated each loss terms Lic , Lbc and {L ir }Mi=1 .(b) Compute and update the temporal weights by 
+
+wi = exp −ϵ
+
+> i−1
+
+X
+
+> k=1
+
+Lkr (θ)
+
+!
+
+, for i = 2 , 3, . . . , M. (2.11) Here ϵ > 0 is a user-defined hyper-parameter that determines the "slope" of temporal weights. 
+
+if n mod f = 0 then 
+
+(c) Compute the global weights by 
+
+ˆλic = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lic (θ)∥ , (2.12) 
+
+ˆλbc = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lbc (θ)∥ , (2.13) 
+
+ˆλr = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥∥∇ θ Lr (θ)∥ , (2.14) where ∥ · ∥ denotes the L2 norm. (d) Update the global weights λ = ( λic , λ bc , λ r ) using a moving average of the form 
+
+λnew = αλ old + (1 − α)ˆλnew . (2.15) where the parameter α determines the balance between the old and new values 
+
+end if 
+
+(e) Update the parameters θ via gradient descent 
+
+θn+1 = θn − η∇θ L(θn) (2.16) (2.17) 
+
+end for 
+
+The recommended default values for hyper-parameters are as follows: f = 1 , 000 , α = 0 .9, γ = 1 .0, ϵ = 1 .0. Please note that we freeze the back-propagation of the weights wi’s and λi’s with respect to network parameters θ.4only prevent vanishing gradients but also accelerate training convergence. A critical assumption for these initialization methods is that input variables should be in a moderate range, such as having a zero mean and unit variance, which enables smooth and stable forward and backward propagation. To satisfy this assumption, we propose using non-dimensionalization to scale the input and output variables so that they are of order one. • Mitigating the disparities in variable scales: If input and output variables have different scales, some can dominate over others, leading to unbalanced contributions in the model training, therefore hindering the learning of meaningful correlations between them. Non-dimensionalization, which scales variables to have similar magnitudes and ranges, can help to reduce this discrepancy and facilitate model training. • Improving convergence: If the variables are not properly scaled, the optimization algorithm may have to take very small steps to adjust the weights for one variable while large steps for another variable. This may result in a slow and unstable training process. Through non-dimensionalization, the optimizer can take more consistent steps, yielding faster convergence and better performance. While non-dimensionalization is an indispensable pre-processing step, it is not a “silver bullet” that can resolve all issues in training PINNs. One of the main differences between PINNs and conventional deep learning tasks is the minimization of PDE residuals, which introduces additional difficulties in optimization process. Even if all variables are properly scaled via non-dimensionalization, the scale of PDE residuals can still vastly differ from the scale of the latent solution function, leading to a considerable discrepancy in the scale of different loss terms. Therefore, it is important to carefully inspect and re-scale the loss terms that define the PINNs objective. In section 5.2, we introduce two self-adaptive loss weighting schemes based on the magnitude of back-propagated gradients and Neural Tangent Kernel (NTK) theory. We will show that these methods can automatically balance the interplay between different loss terms during training and lead to more robust model performance. 
+
+# 4 Network Architecture 
+
+In this section, we delve into the selection of suitable network architectures for training PINNs. We begin by providing a brief overview of multi-layer perceptrons, along with common hyper-parameter choices, activation functions, and initialization schemes. Then, we discuss random Fourier feature embeddings, a simple yet effective technique that enables coordinate MLPs to learn complex high frequency functions. Finally, we introduce random weight factorization, a simple drop-in replacement of dense layers that has been shown to consistently accelerate training convergence and improve model performance. 
+
+4.1 Multi-layer Perceptrons (MLP) 
+
+We mainly use multi-layer perceptrons (MLPs) as a universal approximator to represent the latent functions of interest, which takes the coordinates of a spatio-temporal domain as inputs and predicts the corresponding target solution functions. Specifically, let x ∈ Rd be the input, g(0) (x) = x and d0 = d. A MLP fθ (x) is recursively defined by 
+
+f (l)(x) = W(l) · g(l−1) (x) + b(l), g(l)(x) = σ(f (l) 
+
+> θ
+
+(x)) , l = 1 , 2, . . . , L, (4.1) with a final layer 
+
+fθ (x) = W(L+1) · g(L)(x) + b(L+1) , (4.2) where W(l) ∈ Rdl×dl−1 is the weight matrix in l-th layer and σ is an element-wise activation function. Here, 
+
+θ =  W(1) , b(1) , . . . , W(L+1) , b(L+1)  represents all trainable parameters in the network. In practice, the choice of an appropriate network architecture impacts the success of physics-informed neural networks. From our experience, networks that are too narrow and shallow lack the expressive capacity to capture complex nonlinear functions, while networks that are too wide and deep can be difficult to optimize. Therefore, we recommend employing networks with width and depth ranging from 128 to 512 and 3 to 6, respectively, which tends to yield relatively optimal and robust results. To build a continuously differentiable neural representation, we recommend using the hyperbolic tangent (Tanh). Other popular choices include sinusoidal functions [ 36 ] and GeLU [ 57 ]. We point out that ReLU is not suitable since its second-order derivative is zero, which inevitably saturates the computation of PDE residuals. Finally, dense layers will be typically initialized using the Glorot scheme [55]. 
+
+4.2 Random Fourier features 
+
+As demonstrated by [ 52 , 58 , 59 ], MLPs suffer from a phenomenon referred to as spectral bias, showing that they are biased towards learning low frequency functions. This undesired preference also prevents PINNs from learning 5high frequencies and fine structures of target solutions [ 35 ]. In Appendix A, we present a detailed analysis of this phenomenon via a lens of Neural Tangent Kernel (NTK) theory, To mitigate spectral bias, Tancik et al. [ 60 ] proposed random Fourier feature embeddings, which map input coordinates into high frequency signals before passing through a MLP. This encoding γ : Rn → Rm is defined by 
+
+γ(x) = 
+
+cos( Bx )sin( Bx )
+
+
+
+, (4.3) where each entry in B ∈ Rm×d is sampled from a Gaussian distribution N (0 , σ 2) and σ > 0 is a user-specified hyper-parameter. This simple method has been shown to significantly enhance the performance of PINNs in approximating sharp gradients and complex solutions [ 35 ]. It is worth emphasizing the significance of the scale factor σ in the performance of neural networks. As demonstrated in Appendix A and [ 35 ], this hyper-parameter directly governs the frequencies of γi and the resulting eigenspace of the NTK, thereby biasing the network to learn certain band-limited signals. Specifically, lower encoding frequencies can result in blurry predictions, while higher encoding frequencies can introduce salt-and-pepper artifacts. Ideally, an appropriate σ should be selected such that the band width of NTK matches that of the target signals. This not only accelerates the training convergence, but also improves the prediction accuracy. However, the spectral information of the solution may not be accessible when solving forward PDEs. In practice, we recommend using a moderately large σ ∈ [1 , 10] .
+
+4.3 Random weight factorization 
+
+Recently, Wang et al. [ 61 ] proposed random weight factorization (RWF) and demonstrated that it can consistently improve the performance of PINNs. RWF factorizes the weights associated with each neuron in the network as 
+
+w(k,l ) = s(k,l ) · v(k,l ), (4.4) for k = 1 , 2, . . . , d l, l = 1 , 2, . . . , L + 1 , where w(k,l ) ∈ Rdl−1 is a weight vector representing the k-th row of the weight matrix W(l), s(k,l ) ∈ R is a trainable scale factor assigned to each individual neuron, and v(k,l ) ∈ Rdl−1 .Consequently, the proposed weight factorization can be written by 
+
+W(l) = diag( s(l)) · V(l), l = 1 , 2, . . . , L + 1 . (4.5) with s(l) ∈ Rdl .We provide a geometric intuition of weight factorization in Appendix B. More theoretical and experimental results can be found in Appendix B and [61]. In practice, RWF is applied as follows. We first initialize the parameters of an MLP using a standard scheme, e.g. Glorot scheme [ 55 ]. Then, for every weight matrix W, we proceed by initializing a scale vector exp( s) where s is sampled from a multivariate normal distribution N (μ, σ I) . Then every weight matrix is factorized by the associated scale factor as W = diag(exp( s)) · V at initialization. Finally, we apply gradient descent to the new parameters s, V
+
+directly. This procedure is summarized in Algorithm 2. The use of exponential parameterization is motivated by Weight Normalization [ 62 ] to strictly avoid zeros or very small values in the scale factors and allow them to span a wide range of different magnitudes. Empirically, too small μ, σ values may lead to performance that is similar to a conventional MLP, while too large μ, σ can result in an unstable training process. Therefore, we recommend setting μ = 0 .5 or 1,and σ = 0 .1, which seem to consistently and robustly improve the loss convergence and model accuracy. 
+
+Algorithm 2 Random weight factorization (RWF) 1. Initialize a neural network fθ with θ = {W(l), b(l)}L+1  
+
+> l=1
+
+(e.g. using the Glorot scheme [55]). 
+
+for l = 1 , 2, . . . , L do 
+
+(a) Initialize each scale factor as s(l) ∼ N (μ, σI ).(b) Construct the factorized weight matrices as W(l) = diag (exp( s(l))) · V(l).
+
+end for 
+
+2. Train the network by gradient descent on the factorized parameters {s(l), V(l), b(l)}L+1  
+
+> l=1
+
+.The recommended hyper-parameters are μ = 1 .0, σ = 0 .1.65 Training 
+
+5.1 Respecting Temporal Causality 
+
+In this section, we discuss the motivation and details of equation (2.11) in Algorithm 1. Recently, Wang et al. [53 ]illustrates that PINNs may violate temporal causality when solving time-dependent PDEs, and hence are susceptible to converge towards erroneous solutions. This is mainly because the conventional PINNs tend to minimize all PDE residuals simultaneously meanwhile they are undesirably biased toward minimizing PDE residuals at later time, even before obtaining the correct solutions for earlier times. A more detailed analysis can be found in Appendix C and [ 53 ]. To impose the missing causal structure within the optimization process, we first split the temporal domain into M equal sequential segments and introduce Lir to denote the PDE residual loss within the i-th segment of the temporal domain. Then the original PDE residual loss can be rewritten as 
+
+Lr (θ) = 1
+
+M
+
+> M
+
+X
+
+> i=1
+
+wiLir (θ). (5.1) Combing with equation (2.11), we obtain 
+
+Lr (θ) = 1
+
+M
+
+> M
+
+X
+
+> i=1
+
+exp −ϵ
+
+> i−1
+
+X
+
+> k=1
+
+Lkr (θ)
+
+!
+
+Lir (θ). (5.2) It can be observed that wi is inversely exponentially proportional to the magnitude of the cumulative residual loss from the previous time steps. As a result, Lir (θ) will not be minimized unless all previous residuals {L kr (θ)}i−1 
+
+> k=1
+
+decrease to sufficiently small value such that wi is large enough. These temporal weights encourage PINNs to the PDE solution progressively along the time axis, in accordance with how the information propagates in time, as the dynamics evolve throughout the spatio-temporal domain. We emphasize that the computational cost of calculating temporal weights is negligible, as the temporal weights are computed by directly evaluating the PINNs loss functions, whose values are already stored in the computational graph during training. Moreover, it is important to note that the temporal weights are functions of the trainable parameters θ.In our JAX implementation [ 63 ], we use lax.stop_gradient to avoid the computation of back-propagated gradients of temporal weights, thereby further conserving computational resources. We must point out that the proposed weighted residual loss does exhibit some sensitivity to the causality parameter ϵ.Choosing a very small ϵ may fail to impose enough temporal causality, resulting in the PINN model behaving similarly to the conventional one. Conversely, choosing a large ϵ value can result in a more difficult optimization problem, because the temporal residuals at earlier times have to decrease to a very small value in order to activate the latter temporal weights. Achieving this may be difficult in some cases due to limited network capacity for minimizing the target residuals. In practice, we recommend choosing a moderately large ϵ to ensure that all temporal weights can properly converge to 1 at the end of training. If this is not the case, it would be advisable to slightly reduce the value of 
+
+ϵ.
+
+5.2 Loss Balancing 
+
+As mentioned in Section 3, one of the main challenges in training PINNs is addressing multi-scale losses that arise from the minimization of PDE residuals. These losses cannot be normalized during the pre-processing step. An alternative approach involves assigning appropriate weights to each loss term to scale them during training. However, manually choosing weights is impractical, as the optimal weights can vary greatly across different problems, making it difficult to find a fixed empirical recipe that is transferable to various PDEs. More importantly, since the solution to a PDE is unknown, there is no validation data-set available for fine-tuning these hyper-parameters in the context of solving PDEs. Given that, our training pipeline integrates a self-adaptive learning rate annealing algorithm, which can automatically balance losses during training. Specifically, we first compute ˆλ by equation (2.12)-(2.14). Then we obtain 
+
+∥ˆλic ∇θ Lic (θ)∥ = ∥ˆλbc ∇θ Lic (θ)∥ = ∥ˆλr ∇θ Lic (θ)∥ = ∥∇ θ Lic (θ)∥ + ∥∇ θ Lbc (θ)∥ + ∥∇ θ Lr (θ)∥ (5.3) This effectively guarantees that the norm of gradients of each weighted loss is equal to each other, preventing our model from being biased towards minimizing certain terms during training. The actual weights are then updated as a running average of their previous values, as defined by Equation (2.15) . This technique mitigates the instability of stochastic gradient descent. In practice, these updates can either take place every hundred or thousand iterations of the gradient 7descent loop or at a user-specified frequency. Consequently, the extra computational overhead associated with these updates is negligible, particularly when updates are infrequent. We now introduce an alternative weighting scheme that leverages the resulting NTK matrix of PINNs [ 27 ]. To this end, we define the NTK matrices corresponding to Lic , Lbc , and Lr as follows: 
+
+Kic =
+
+ ∂u ∂θ (0 , xiic ), ∂u ∂θ (0 , xjic )
+
+ 
+
+, (5.4) 
+
+Kbc =
+
+ ∂u ∂θ (tibc , xibc ), ∂u ∂θ (tjbc , xjbc )
+
+ 
+
+, (5.5) 
+
+Kbc =
+
+ ∂R[u]
+
+∂θ (tir , xir ), ∂R[u]
+
+∂θ (tjr , xjr )
+
+ 
+
+, (5.6) where R[·] denotes the residual operator defined in (2.4). With this definition, we can establish an NTK-based weighting scheme as shown below 
+
+ˆλic = T r (Kic ) + T r (Kbc ) + T r (Kr )
+
+T r (Kic ) , (5.7) 
+
+ˆλbc = T r (Kic ) + T r (Kbc ) + T r (Kr )
+
+T r (Kbc ) , (5.8) 
+
+ˆλr = T r (Kic ) + T r (Kbc ) + T r (Kr )
+
+T r (Kr ) . (5.9) We proceed by updating the λi values using a moving average, as previously described. As detailed in Appendix A, the eigenvalues of NTK characterize the convergence rate of a loss function. Higher eigenvalues imply a faster convergence rate. Given that the trace of an NTK matrix is equal to the sum of all its eigenvalues, this scheme aims to balance the convergence rates of different loss terms such that their convergence rates are comparable to one another. In practice, it should be noted that we avoid constructing the full NTK matrix. Instead, we evaluate only the diagonal elements of the NTK matrix for computing the weights, which significantly saves computational resources. We observed that while the performance of the gradient-based and NTK-based weighting schemes is similar, the updated weights in the gradient-based scheme are less stable compared to the NTK-based scheme. This instability may be attributed to the noisy back-propagated gradients due to random mini-batches. However, the NTK-based scheme demands a higher computational cost, making it more difficult to scale to complex problems. As a result, we generally recommend employing the gradient-based scheme as a first choice. 
+
+5.3 Curriculum Training 
+
+While the techniques detailed in the preceding sections have greatly enhanced the performance and application range of PINNs, there remain certain complex domains where PINNs encounter challenges, especially in scenarios where high predictive accuracy is required. For example, when simulating chaotic dynamical systems such as the Navier-Stokes equations at high Reynolds numbers, enhanced accuracy is required to prevent error accumulation and trajectory divergence. In this section, we aim to shed light on these challenging areas and explore pathways to overcome them. A promising approach we will delve into is the curriculum training strategy introduced by Krishnapriyan et. al. [48 ]. The core idea involves decomposing the entire optimization task for PINNs into a sequence of more manageable sub-tasks. In this work, we mainly focus on integrating this strategy into our training pipeline for solving time-dependent PDEs and singular perturbation problems. For time-dependent PDEs, we divide the temporal domain into smaller intervals and employ Algorithm 2 to train PINNs for solving the PDE within each of these intervals. Except for the first time window, the initial condition for each subsequent time window is updated using the prediction from the last time-step of the previous time window. This approach can be viewed as a temporal domain decomposition strategy, and significantly reduces the optimization difficulty of learning the full evolution of a dynamical system while increasing computational costs due to model retraining for each window. It is worth noting that we also partition the temporal domain in Algorithm 2 to compute the causal weights within the time-window. We emphasize that the causal weighting shares a similar motivation with “time-marching”, in the sense of respecting temporal causality by learning the solution sequentially along the time axis. Nevertheless, the causal weighting discussed in section 5.1 should not be considered a replacement for time-marching approaches, but rather a crucial enhancement, as violations of causality may still occur within each time window of a time-marching algorithm. 8In addressing singular perturbation problems, our strategy hinges on a progressive approach. We initiate the training process with a less singular PDE and progressively increase its singularity throughout the training. For example, if our goal is to solve the Navier-Stokes equation at moderately high Reynolds numbers, we typically start by training a model for a lower Reynolds number and use this result as a suitable initialization for minimizing PDE residuals at higher Reynolds numbers. Through our experiments, we have observed that this approach effectively stabilizes the training process. It reduces the likelihood of PINNs becoming trapped in unfavorable local minima, thus enabling them to accurately capture complex and nonlinear PDE solutions. For a more concrete illustration, readers are directed to the example of lid-driven cavity flow in section 7.5. 
+
+# 6 Miscellaneous 
+
+In this section, we introduce several aspects that researchers and practitioners should consider when using PINNs to promote robust and optimal performance. The discussion highlights the importance of selecting appropriate optimizers and learning rates, imposing exact boundary conditions, employing random sampling and a modified MLP architecture. 
+
+6.1 Optimizer and learning rate 
+
+Numerous optimizers have been developed for deep learning applications; however, we find that the Adam optimizer consistently yields good performance without heavy tuning. Moreover, we do not recommend using weight decay especially for forward problems, as it tends to degrade the resulting predictive accuracy. Furthermore, the learning rate is a crucial factor in PINNs’ performance. Our experience suggests that an initial learning rate of 0.001 , combined with exponential decay, typically yields good results. 
+
+6.2 Random sampling 
+
+The choice of an appropriate sampling strategy for collocation points plays an important role in the training efficiency and model performance. In comparison to full batch sampling, random sampling significantly reduces the memory requirements and the computational cost of each iteration. More importantly, random sampling introduces regularization effects, which ultimately contribute to the improved generalization capabilities of PINNs. Based on our observations, training PINNs using full-batch gradient descent may result in over-fitting the PDE residuals. Consequently, we strongly recommend using random sampling in all PINN simulations to achieve optimal performance. 
+
+6.3 Imposing boundary conditions 
+
+Recent work by Dong et al. [ 64 ] showed how to strictly impose periodic boundary conditions in PINNs as hard-constraints, which not only effectively reduces the number of loss constraints but also significantly enhances training convergence and predictive accuracy. To illustrate the main idea, let us consider enforcing periodic boundary conditions with period P in a one-dimensional setting. Our goal is to build a network architecture satisfying 
+
+u(l)(a) = u(l)(a + P ), l = 0 , 1, 2, . . . . (6.1) To this end, we construct a special Fourier feature embedding of the form 
+
+v(x) = (cos( ωx ), sin( ωx )) , (6.2) with ω = 2πL . Then, for any network representation uθ , it can be proved that any uθ (v(x)) exactly satisfies the periodic boundary condition. The same idea can be directly extended to higher-dimensional domains. For instance, let (x, y ) denote the coordinates of a point in two dimensions, and suppose that u(x, y ) is a smooth periodic function to be approximated in a periodic cell [a, a + Px] × [b, b + Py ], satisfying the following constraints 
+
+∂l
+
+∂x l u (a, y ) = ∂l
+
+∂x l u (a + Px, y ) , y ∈ [b, b + Py ] , (6.3) 
+
+∂l
+
+∂y l u (x, a ) = ∂l
+
+∂y l u (x, b + Py ) , x ∈ [a, a + Px] , (6.4) for l = 0 , 1, 2, . . . , where Px and Py are the periods in the x and y directions, respectively. Similar to the one-dimensional setting, these constraints can be implicitly encoded in a neural network by constructing a two-dimensional Fourier features embedding as 
+
+v(x, y ) = [cos ( ωxx) , sin ( ωxx) , cos ( ωy y) , sin ( ωy y)] (6.5) 9with wx = 2πPx , w y = 2πPy .For time-dependent problems, we simply concatenate the time coordinates t with the constructed Fourier features embedding, i.e., uθ ([ t, v(x)]) , or uθ ([ t, v(x, y )]) . In particular, if the PDE solutions are known to exhibit periodic behavior over time, we can also enforce periodicity along the time axis. More precisely, we consider the following special Fourier embedding 
+
+w(t, x ) = [cos( ωtt), sin( ωtt), v(t, x )] (6.6) where ωt = 2πPt . The key difference is that Pt is a trainable parameter. Typically, Pt is initialized to the temporal domain’s length, allowing networks to learn the solution’s correct period. It is worth emphasizing that this assumption of time periodicity is not a severe restriction, and this technique can be applied to arbitrary dynamical systems, even if the solution is not periodic. This is because one can always set the initial Pt greater than the length of the temporal domain. Lastly, other types of boundary conditions, including Dirichlet, Neumann, Robin, etc., can also be enforced in a “hard” manner by modifying the network outputs, see [65, 66] for more details. 
+
+6.4 Modified MLP 
+
+In practice, we found that a simple modification of MLPs proposed by Wang et al. [ 26 ] demonstrates an enhanced capability for learning nonlinear and complex PDE solutions. The forward pass of an L-layer modified MLP is defined as follows. First, we introduce two encoders for the input coordinates 
+
+U = σ(W1x + b1), V = σ(W2x + b2). (6.7) Then, for l = 1 , 2, . . . , L ,
+
+f (l)(x) = W(l) · g(l−1) (x) + b(l), (6.8) 
+
+g(l)(x) = σ(f (l) 
+
+> θ
+
+(x)) ⊙ U + (1 − σ(f (l) 
+
+> θ
+
+(x))) ⊙ V. (6.9) The final network output is given by 
+
+fθ (x) = W(L+1) · g(L)(x) + b(L+1) . (6.10) Here, σ denotes a nonlinear activation function, and ⊙ denotes an element-wise multiplication. All trainable parameters are given by 
+
+θ = {W1, b1, W2, b2, (W(l), b(l))L+1  
+
+> l=1
+
+}. (6.11) This architecture is almost the same as a standard MLP network, with the addition of two encoders and a minor modification in the forward pass. Specifically, the inputs x are embedded into a feature space via two encoders U, V,respectively, and merged in each hidden layer of a standard MLP using a point-wise multiplication. In our experience, the modified MLP demands greater computational resources; however, it generally outperforms the standard MLP in effectively minimizing PDE residuals, thereby yielding more accurate results. 
+
+# 7 Results 
+
+In this section, we present a series of challenging benchmarks for evaluating PINNs performance and illustrate the effectiveness of Algorithm 1, along with the proposed training strategies. Besides, we showcase the state-of-the-art results for each benchmark, demonstrating the current performance capacity of PINNs. More importantly, we believe that these results can establish robust and strong baselines, enabling future researchers to perform thorough evaluations and comparisons of their novel methods. This paves the way for continued innovation and developments in this field. For each benchmark, except for the last two, we perform comprehensive ablation studies to assess the effectiveness of the methods presented in the previous sections. In each ablation study we systematically disable each methodological component individually, while keeping the others active under the same hyper-parameter settings, and evaluate the resulting relative L2 error and run-time. This allows us to isolate the effects of each component and understand their contribution to the overall model performance. Throughout all ablation studies, we maintain the following hyper-parameter settings, unless stated otherwise. Specifically, we employ an MLP with 4 hidden layers, 256 neurons per hidden layer, and tanh activation functions as our backbone, initializing it using the Glorot scheme [ 55 ]. For model training, we use the Adam optimizer [ 67 ], starting with a learning rate of 10 −3 and an exponential decay with a decay 10 Benchmark Relative L2 error 
+
+Allen-Cahn equation 5.37 × 10 −5
+
+Advection equation 6.88 × 10 −4
+
+Stokes flow 8.04 × 10 −5
+
+Kuramoto–Sivashinsky equation 1.61 × 10 −1
+
+Lid-driven cavity flow (Re=3200) 1.58 × 10 −1
+
+Navier–Stokes flow in a torus 2.45 × 10 −1
+
+Navier–Stokes flow around a cylinder –Table 1: State-of-the-art relative L2 error for various benchmark equations using our proposed model. 21 23 25 27
+
+> Number of GPU(s)
+> 20 40 60 80 100
+> Parallel Efficiency (%)
+
+Weak Scaling Performance 
+
+> Ideal Measured
+
+Figure 2: Efficiency of weak scaling using the Navier-Stokes flow (section 7.6) as a benchmark. We employ a neural network with hyper-parameters shown in Table 12 and measure the execution time for 10,000 iterations, maintaining a consistent batch size of 40960 per GPU. rate of 0.9 for every 2, 000 decay steps. The collocation points are uniformly sampled from the computational domain with a batch size of 4096 . The total number of training iterations can vary depending on the complexity of the example. Furthermore, we conduct extensive hyper-parameter sweeps across various learning rate schedules, network sizes, and activations, in order to produce state-of-the-art results for each example. Note that the hyper-parameter settings for our ablation studies differ from those yielding the best results. We summarize our results in Table 1 and provide detailed hyper-parameter settings for our optimal models in the Appendix. Throughout all numerical experiments, when applicable, we enforce the exact periodic boundary conditions as described in section 6. The code and data accompanying this manuscript will be made publicly available at https://github.com/ PredictiveIntelligenceLab/jaxpi . It should be highlighted that our implementation automatically supports efficient data-parallel multi-GPU training. As illustrated in Figure 2, we show great weak scaling capabilities up to 256 GPUs, enabling the effective simulation of large-scale problems. Additionally, our code includes valuable utilities for monitoring gradient norms and NTK eigenvalues throughout training—key metrics essential for identifying potential training issues. 
+
+7.1 Allen-Cahn equation 
+
+We start with 1D Allen-Cahn equation, a representative case with which conventional PINN models are known to struggle. It takes the form 
+
+ut − 0.0001 uxx + 5 u3 − 5u = 0 , t ∈ [0 , 1] , x ∈ [−1, 1] , (7.1) 
+
+u(0 , x ) = x2 cos( πx ), (7.2) 
+
+u(t, −1) = u(t, 1) , (7.3) 
+
+ux(t, −1) = ux(t, 1) . (7.4) For this example, we first train a conventional PINN model to diagnose potential issues. In Figure 3, we visualize the histogram of back-propagated gradients, the resulting NTK eigenvalues and the temporal PDE residual loss (equation 11 −0.10 −0.05 0.00 0.05 0.10 Values 010 20 30 40 Gradients density                 
+
+> ∇θLic
+> ∇θLr
+> 0.00.20.40.60.81.0
+> t
+> 10 −4
+> 10 −3
+> 10 −2
+> 10 −1
+> Lr(t, θ )
+> 0100 200 300 400 500 Index 10 −5
+> 10 −2
+> 10 1
+> 10 4
+> Eigenvalue
+> Eigenvalues of Kic
+> 01000 2000 3000 4000 Index 10 −5
+> 10 −2
+> 10 1
+> 10 4
+> Eigenvalue
+> Eigenvalues of Kr
+
+Figure 3: Allen Cahn equation: Analysis of training a plain PINN model for 10 , 000 iterations. Top left : Histograms of back-propagated gradients of the PDE residual loss and initial condition loss at the last iteration. Top right : Temporal PDE residual loss at the last iteration. Bottom: NTK eigenvalues of Kic and Kr at the last iteration. (2.10) ) at the early stages of training. On the top left panel, one can see that the gradients of PDE residual loss dominates those of the initial condition loss, which implies unbalanced back-propagated gradients. Moreover, the top right panel reveals that the network tends to minimize the PDE residuals at later times first, suggesting a violation of causality. In the bottom panel, a rapid decay in the NTK eigenvalues can be observed, indicating the presence of spectral bias. These findings strongly suggest that conventional PINNs suffer from multiple severe training pathologies, which need to be addressed simultaneously to yield satisfactory results. To showcase the effectiveness of the proposed training pipeline in addressing these issues, we employ Algorithm 1 and disable individual methodological components one-at-a-time. The results are summarized in Table 2 and Figure 4. It can be concluded that the full algorithm yields the best performance in terms of relative L2 error of 5.84 × 10 −4.Removing any individual component from the algorithm generally leads to a worse performance, which indicates the positive contributions of each component to the overall model performance. The most significant negative impact on performance occurs when disabling the Fourier Feature embedding, resulting in a relative L2 error of 4.35 × 10 −1.It implies that the spectral bias degrades the predictive accuracy the most for this example. Furthermore, it is worth noting that the run-times across different configurations are relatively similar, except for the case corresponding to conventional PINNs, which shows a slightly shorter run-time of 12 .93 minutes. This highly suggests the computational efficiency of each component presented in Algorithm 1. Finally, we present our best result in Figure 5, whereas Table 6 details the corresponding hyper-parameter configuration, and Figure 19 visualizes the loss convergence and the weight changes during training. One can see that the predicted solution achieves an excellent agreement with the reference solution, yielding a relative L2 error of 5.37 × 10 −5.
+
+7.2 Advection equation 
+
+Our second example is a 1D advection equation; a linear hyperbolic equation commonly used to model transport phenomena. It takes the form 
+
+∂u ∂t + c ∂u ∂x = 0 , t ∈ [0 , 1] , x ∈ (0 , 2π), (7.5) 
+
+u(0 , x ) = g(x), x ∈ (0 , 2π), (7.6) 12 0.0 0.5 1.0 1.5Iteration ×10 5
+
+> 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> 10 1
+> Rel.  L2 Error
+> Plain Default No RWF No Grad Norm No Fourier Feature No Causal
+
+Figure 4: Allen Cahn equation: Convergence of relative L2 error for the ablation study with different components disabled. Plain: Conventional PINN formulation. Default: PINN model trained using Algorithm 1. No RWF: PINN model trained using Algorithm 1 without random weight factorization. No Grad Norm: PINN model trained using Algorithm 1 without grad norm weighting scheme. No Fourier feature: PINN model trained using Algorithm 1 without random Fourier feature embeddings. No Causal: PINN model trained using Algorithm 1 without casual weighting. 
+
+Ablation Settings Performance Fourier Feature RWF Grad Norm Causal Rel. L2 error Run time (min) 
+
+✓ ✓ ✓ ✓ 5.84 × 10 −4 16.26 
+
+✗ ✓ ✓ ✓ 4.35 × 10 −1 13.20 
+
+✓ ✗ ✓ ✓ 6.62 × 10 −3 16.53 
+
+✓ ✓ ✗ ✓ 7.51 × 10 −3 16.36 
+
+✓ ✓ ✓ ✗ 1.59 × 10 −3 16.11 
+
+✗ ✗ ✗ ✗ 51 .74 × 10 −1 12.93 Table 2: Allen Cahn equation: Relative L2 error and run time for an ablation study illustrating the impact of disabling individual components of the proposed training pipeline. Note that the GPU run time may vary due to factors such as hardware utilization, batch processing, and other computational loads. 
+
+Figure 5: Allen Cahn equation: Comparison of the best prediction against the reference solution. The resulting relative 
+
+L2 error is 5.37 × 10 −5. The hyper-parameter configuration can be found in Table 6. 13 −0.10 −0.05 0.00 0.05 0.10 Values 0200 400 600 800 1000 Gradients density           
+
+> ∇θLic
+> ∇θLr
+> 0.00.20.40.60.81.0
+> t
+> 10 −4
+> 10 −3
+> 10 −2
+> 10 −1
+> Lr(t, θ )
+> 025 50 75 100 125 Index 10 −5
+> 10 −3
+> 10 −1
+> 10 1
+> 10 3
+> 10 5
+> Eigenvalue
+
+Eigenvalues of Kic     
+
+> 01000 2000 3000 4000 Index 10 −5
+> 10 −3
+> 10 −1
+> 10 1
+> 10 3
+> 10 5
+> Eigenvalue
+
+Eigenvalues of KrFigure 6: Advection equation: Analysis of training a plain physics-informed neural network for 10,000 iterations. Top left : Histograms of back-propagated gradients of the PDE residual loss and initial condition loss at the last iteration. Top right : Temporal PDE residual loss at the last iteration. Bottom: NTK eigenvalues of Kic and Kr at the last iteration. with periodic boundary conditions. This example has been studied in [ 48 , 31 ], exposing some of the limitations that PINNs suffer from as the transport velocity c is increased. In our experiments, we consider the challenging setting of 
+
+c = 80 with an initial condition g(x) = sin( x).Analogous to the first example, we train a conventional PINN model with the aim of identifying the issues that lead to inaccurate results. As illustrated in Figure 3, it is evident that PINNs experience the same challenges as those observed in the first example. This observation strongly suggests the widespread nature of these issues in the training of PINNs, further emphasizing the necessity of addressing them to obtain robust and accurate PINN models. As mentioned in section 6.3, we can impose the spatial and temporal periodicity by 
+
+v(t, x ) = [cos( ωtt), sin( ωtt), cos( ωxx), sin( ωxx)] , (7.7) where ωt = 2πPt and ωx = 2πPx with Px = 2 π and Pt a trainable parameter. For this example, we incorporate the imposition of temporal periodicity in Algorithm 1 and subsequently perform an ablation study on the enhanced algorithm. The performance of various configurations is summarized in Table 3. One can conclude that the integration of all the techniques together yields the optimal accuracy. The exclusion of any of these elements, especially the time periodicity, Fourier Features and the grad norm weighting scheme, leads to a significant increase in test error, highlighting their crucial role in achieving accurate results. Additionally, we present the state-of the-art result in Figure 8. We see that the model prediction achieves an excellent agreement with the exact solution, with an relative L2 error of 6.88 · 10 −4. The hyper-parameter configuration and loss convergence are presented in Table 7 and Figure 20, respectively. 
+
+7.3 Stokes flow 
+
+In this example, we explore a specific example of Stokes flow with the aim of emphasizing the importance of non-dimensionalization in PINNs training. Stokes flow is a fluid flow regime where viscous forces outweigh inertial forces, occurring in scenarios such as small particle motion in liquids, fluid flow through porous media, and microorganism 14 Ablation Settings Performance Time Period Fourier Feature RWF Grad Norm Causal Rel. L2 error Run time (min) 
+
+✓ ✓ ✓ ✓ ✓ 1.02 × 10 −2 9.18 
+
+✗ ✓ ✓ ✓ ✓ 7.37 × 10 −1 8.76 
+
+✓ ✗ ✓ ✓ ✓ 4.29 × 10 −1 7.60 
+
+✓ ✓ ✗ ✓ ✓ 1.31 × 10 −2 9.25 
+
+✓ ✓ ✓ ✗ ✓ 1.13 × 10 0 7.46 
+
+✓ ✓ ✓ ✓ ✗ 1.49 × 10 −2 9.18 
+
+✗ ✗ ✗ ✗ ✗ 9.51 × 10 −1 7.12 Table 3: Advection equation: Relative L2 error and run time for an ablation study illustrating the impact of disabling individual components of the proposed technique and training pipeline. 0.0 0.5 1.0 1.5Iteration ×10 5
+
+> 10 −2
+> 10 −1
+> 10 0
+> Rel.  L2 Error
+> Plain Default No Time Period No RWF No Grad Norm No Fourier Feature No Causal
+
+Figure 7: Advection equation: Convergence of relative L2 error for the ablation study with different components disabled. Plain: Conventional PINN formulation. Default: PINN model trained imposing time periodicity and using Algorithm 1. No Time Period: PINN model trained using Algorithm 1. No RWF: PINN model trained imposing time periodicity and using Algorithm 1 without random weight factorization. No Grad Norm: PINN model trained imposing time periodicity and using Algorithm 1 without grad norm weighting scheme. No Fourier feature: PINN model trained imposing time periodicity and using Algorithm 1 without random Fourier feature embeddings. No Causal: PINN model trained imposing time periodicity and using Algorithm 1 without casual weighting. 
+
+Figure 8: Advection equation: Comparison of the best prediction against the reference solution obtained from the hyper-parameter sweep. The resulting relative L2 error is 6.88 × 10 −4. The hyper-parameter configuration can be found in Table 7. 15 locomotion in fluid environments. The governing equation is given by 
+
+−ν∆u + ∇p = 0 , (7.8) 
+
+∇ · u = 0 , (7.9) where u = ( u, v ) defines the velocity and p the pressure, and ν is the kinematic viscosity. As depicted in Figure 9, the underlying geometry is a pipe Ω = [0 , 2.2] × [0 , 0.41] \Br (0 .2, 0.2) with a circular cylinder obstacle of radius r = 0 .05 . For the top and bottom walls Γ1 = [0 , 2.2] × 0.41 and Γ2 = [0 , 2.2] × 0 as well as the boundary S = ∂B r (0 .2, 0.2) , we impose the no-slip boundary condition 
+
+u|Γ1 = u|Γ2 = u|S = 0 . (7.10) At the inlet Γ3 = 0 × [0 , 0.41] , a parabolic inflow profile is prescribed, 
+
+u(0 , y ) = uin =
+
+ 4U y (0 .41 − y)0.41 2 , 0
+
+
+
+, (7.11) with a maximum velocity U = 0 .3. At the outlet Γ4 = 2 .2 × [0 , 0.41] , we define the outflow condition 
+
+ν∂ nu − pn = 0 , (7.12) where n denotes the outer normal vector. To non-dimensionalize the system, we select the characteristic flow velocity and length as U ∗ = 0 .2 and L∗ = 0 .1,respectively. This results in a Reynolds number of Re = U ∗L∗
+
+ν = 0.2 · 0.10.001 = 20 . (7.13) We can then define the non-dimensionalized variables as 
+
+x∗ = x
+
+L∗ , u∗ = u
+
+U ∗ , p∗ = pL ∗
+
+νU ∗ , ∇∗ = L∗∇. (7.14) By substituting these scales into the dimensionalized system, we obtain the non-dimensionalized PDE as 
+
+− 1
+
+Re ∆u∗ + ∇∗p∗ = 0 in Ω∗, (7.15) 
+
+∇∗u∗ = 0 in Ω∗, (7.16) 
+
+u∗ = 0 on Γ∗ 
+
+> 1
+
+∪ Γ∗ 
+
+> 2
+
+∪ S∗, (7.17) 
+
+u∗ = uin 
+
+U ∗ on Γ∗
+
+> 3
+
+, (7.18) 
+
+1
+
+Re 
+
+∂u∗
+
+∂n − p∗n = 0 on Γ∗
+
+> 4
+
+, (7.19) where Ω∗, S∗ and {Γi}4 
+
+> i=1
+
+denote the non-dimensionalized domains, respectively. To perform an ablation study for Algorithm 1, we employ an MLP with 4 hidden layers, 128 neurons per hidden layer, and GeLU activation functions and train each model for 10 5 iterations of gradient descent using the Adam optimizer. The results are summarized in Table 4, and strongly indicate the positive impact of all proposed components on model performance; disabling any one component leads to worse predictive accuracy. In particular, comparing the performance of the configurations with non-dimensionalization enabled (1st row) to the ones with non-dimensionalization disabled (5th rows), we observe a substantial increase in the relative L2 error when non-dimensionalization is removed. This observation highlights the importance of non-dimensionalization in achieving optimal performance for solving the Stokes equation. Moreover, as evidenced by the 3rd and 4th rows of the table, models trained without Fourier features and RWF fail to capture the correct solution, thus implying their essential contribution to the overall model performance. Lastly, we present the results of a fine-tuned PINN model in Figure 11, which exhibits excellent agreement with the reference solution and achieves a relative L2 error of 8.04 × 10 −5. The detailed hyper-parameter configuration and the loss convergence are respectively shown in Table 8 and Figure 21. 
+
+7.4 Kuramoto–Sivashinsky equation 
+
+In this example, we aim to demonstrate the potential of PINNs in simulating chaotic dynamics and highlight the necessity of adopting a time-marching strategy in scenarios where high predictive accuracy is needed. To this end, we 16 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00.20.4
+
+## Γ1
+
+## Γ2
+
+## Γ3 Γ4Figure 9: Stokes equation: Illustration of the pipe geometry for Stokes flow. 
+
+Ablation Settings Performance Fourier Feature RWF Grad Norm Non-dimensionalization Rel. L2 error Run time (min) 
+
+✓ ✓ ✓ ✓ 5.41 × 10 −4 9.51 
+
+✗ ✓ ✓ ✓ 9.56 × 10 −1 7.93 
+
+✓ ✗ ✓ ✓ 9.86 × 10 −1 9.58 
+
+✓ ✓ ✗ ✓ 1.01 × 10 −2 8.63 
+
+✓ ✓ ✓ ✗ 9.74 × 10 −1 9.58 
+
+✗ ✗ ✗ ✗ 9.21 × 10 −1 7.95 Table 4: Stokes equation: Relative L2 error and run time for an ablation study illustrating the impact of disabling non-dimensionalization and individual components of the proposed training pipeline. The error is measured against the norm of flow velocity ∥u∥2 = √u2 + v2.0 2 4 6 8Iteration ×10 4
+
+> 10 −3
+> 10 −2
+> 10 −1
+> 10 0
+> Rel.  L2 Error
+> Plain Default No RWF No Grad Norm No Fourier Feature No Nondimensionalization
+
+Figure 10: Stokes equation: Convergence of relative L2 error for the ablation study with different components disabled. 17 Figure 11: Stokes equation: Comparison of the best prediction against the reference solution obtained from the hyper-parameter sweep. The resulting relative L2 error is 8.04 × 10 −5. The hyper-parameter configuration can be found in Table 8. consider the Kuramoto–Sivashinsky equation, which exhibits a wealth of spatially and temporally nontrivial dynamical behavior, and has served as a model example in efforts to understand and predict the complex dynamical behavior associated with a variety of physical systems. The equation takes the form 
+
+ut + αuu x + βu xx + γu xxxx = 0 , t ∈ [0 , 1] , x ∈ [0 , 2π], (7.20) subject to periodic boundary conditions and an initial condition 
+
+u(0 , x ) = u0(x). (7.21) Specifically, we take α = 100 /16 , β = 100 /16 2, γ = 100 /16 4 and u0(x) = cos( x)(1 + sin( x)) .Based on our experience, it appears highly challenging to conduct long-term integration of this PDE system via a single-shot training of PINNs. This could potentially be attributed to the inherently chaotic nature of the system and the insufficient accuracy of PINNs predictions. To illustrate this point, we train a PINN to simulate the dynamical system up to different final time T without time-marching, while keeping the same hyper-parameter settings. As shown in the left panel of Figure 12, we can see that the resulting relative L2 error drastically increases for larger temporal domains and eventually leads to a failure in correctly capturing the PDE solution. This illustrates the necessity for applying time-marching in order to mitigate the difficulties of approximation and optimization, thus leading to more accurate results. However, we must emphasize that the computational cost of time-marching is considerably larger than one-shot learning as one needs to train multiple PINN models sequentially. It would be interesting to explore the acceleration of this training process in the future work. Moreover, we present an ablation study on Algorithm 1 and summarize our results in Table 5. It can be concluded that all proposed components positively contribute to the overall model performance, and removing of any one of them results in increased errors. Notably, the use of modified MLP greatly enhances the predictive accuracy, reflected in the substantial error reduction from 2.98 × 10 −3 to 1.42 × 10 −4. From our experience, modified MLPs typically outperforms plain MLPs, especially for tackling non-linear PDE systems. Furthermore, the predicted solution obtained from our best model is visualized in Figure 13, which is in a good agreement with the ground truth. Nevertheless, some discrepancies can be observed near t = 1 , which may be due to the error accumulation and the inherent nature of chaos. More details of implementation and training are provided in Table 9 and Figure 22. 18 0.1 0.2 0.3 0.4 0.5
+
+T
+
+10 −4
+
+10 −3
+
+10 −2
+
+10 −1
+
+> Rel.  L2 Error
+
+0 2 4 6 8Iteration ×10 4
+
+10 −4
+
+10 −3
+
+10 −2
+
+10 −1
+
+10 0
+
+> Rel.  L2 Error
+> Plain Default No Causal No RWF No Grad Norm No Fourier Feature No Modified MLP
+
+Figure 12: Kuramoto–Sivashinsky equation: Left: Relative L2 errors from one-shot PINN training for different system final time T under the same hyper-parameter setting. Right: Convergence of relative L2 error for the ablation study with different components disabled. 
+
+Ablation Settings Performance Modified MLP Fourier Feature RWF Grad Norm Causal Rel. L2 error Run time (min) 
+
+✓ ✓ ✓ ✓ ✓ 1.42 × 10 −4 13.33 
+
+✗ ✓ ✓ ✓ ✓ 2.98 × 10 −3 6.21 
+
+✓ ✗ ✓ ✓ ✓ 1.86 × 10 −2 7.60 
+
+✓ ✓ ✗ ✓ ✓ 1.86 × 10 −4 14.11 
+
+✓ ✓ ✓ ✗ ✓ 2.19 × 10 −1 14.11 
+
+✓ ✓ ✓ ✓ ✗ 2.58 × 10 −4 9.18 
+
+✗ ✗ ✗ ✗ ✗ 2.59 × 10 −1 7.12 Table 5: Kuramoto–Sivashinsky equation: Relative L2 error and run time for an ablation study illustrating the impact of disabling individual components of the proposed technique and training pipeline. 
+
+Figure 13: Kuramoto–Sivashinsky equation: Comparison of the best prediction against the reference solution. The relative L2 error of the spatial temporal predicted solution is 1.61 × 10 −1. Note that the the majority of this error is attributed to last few time steps. 19 0 1000 2000 3000 Reynolds number 0.10.20.30.40.5Rel.  L2 Error Figure 14: Lid-driven cavity: Relative L2 error of training PINNs with Algorithm 1 at different Reynolds numbers 
+
+Re ∈ [100 , 400 , 1000 , 3200] .
+
+Figure 15: Lid-driven cavity (Re=3200): Left: Predicted velocity of the fine-tuned model. Right: Comparison of the predicted velocity profiles on the vertical and horizontal center-lines against Ghia et al. [68 ]. The resulting relative L2
+
+error against the reference solution is 1.58 × 10 −1.
+
+7.5 Lid-driven cavity flow 
+
+In this example, we consider a classical benchmark problem in computational fluid dynamics, describing the motion of an incompressible fluid in a two-dimensional square cavity. The system is governed by the incompressible Navier–Stokes equations written in a non-dimensional form 
+
+u · ∇ u + ∇p − 1
+
+Re ∆u = 0 , (x, y ) ∈ (0 , 1) 2, (7.22) 
+
+∇ · u = 0 , (x, y ) ∈ (0 , 1) 2, (7.23) where u = ( u, v ) denotes the velocity in x and y directions, respectively, and p is the scalar pressure field. We assume 
+
+u = (1 , 0) on the top lid of the cavity, and a non-slip boundary condition on the other three walls. We are interested in the velocity and pressure distribution for a Reynolds number of 3200 .In our experience, when trained directly at a high Reynolds number, PINNs tend to be unstable and susceptible of converging to erroneous solutions. This observation is verified by the left panel of Figure 14, where we plot the relative 
+
+L2 errors from training PINNs with Algorithm 1 at varying Reynolds numbers under the same hyper-parameter settings. Our results demonstrate that PINNs struggle to yield accurate solutions for Reynolds numbers greater than 1, 000 .To improve this result, one effective approach is to start the training of PINNs with a lower initial Reynolds number, and gradually increase the Reynolds numbers during training. By this way, the model parameters obtained from the training with smaller Reynolds numbers serve as a good initialization when training for higher Reynolds numbers. To demonstrate this, we select an increasing sequence of Reynolds numbers [100 , 400 , 1000 , 3200] and train PINNs with Algorithm 1 for 5 × 10 4, 5 × 10 4, 1 × 10 5, 5 × 10 5 iterations, respectively. The detailed hyper-parameter configuration is summarized in Table 10. As shown in Figure 15, our predicted velocity field agrees well with the reference results of Ghia et al. [68], yielding a relative L2 error of 1.58 × 10 −1 against the reference solution. 20 Figure 16: Navier-Stokes flow in a torus: Comparison of the best prediction against the reference solution at the last time step. The animation is provided in https://github.com/PredictiveIntelligenceLab/jaxpi .
+
+7.6 Navier–Stokes flow in a torus 
+
+As the second to last example, our goal is to showcase the capability of PINNs in simulating incompressible Navier–Stokes flow using the velocity-vorticity formulation. The equation is given by 
+
+wt + u · ∇ w = 1
+
+Re ∆w, in [0 , T ] × Ω, (7.24) 
+
+∇ · u = 0 , in [0 , T ] × Ω, (7.25) 
+
+w(0 , x, y ) = w0(x, y ), in Ω. (7.26) Here, u = ( u, v ) represents the flow velocity field, w = ∇ × u denotes the vorticity, and Re denotes the Reynolds number. For this example, we define Ω = [0 , 2π]2 and set Re as 100. As the validation and effectiveness of the proposed PINN algorithm have been rigorously proven in prior examples, our focus is on simulating the vorticity evolution up to T = 10 using PINNs. To this end, we split the temporal domain into 5 intervals and employ a time-marching strategy. For each interval, we use a PINN model with a modified MLP (4 hidden layers, 256 neurons per hidden layer, Tanh activations) and train it using Algorithm 1 for 10 5 iterations of gradient descent with Adam optimizer. The results of this simulation are summarized in Figure 16, which provides a visual comparison of the reference and predicted vorticity at T = 10 . While a slight misalignment between the two can be observed, the model prediction is in good agreement with the corresponding numerical estimations. This demonstrates the capability of PINNs to closely match the reference solution, emphasizing its effectiveness in simulating vortical fluid flows. 
+
+7.7 Navier–Stokes flow around a cylinder 
+
+In our last example, we investigate a classical benchmark in computational fluid dynamics, describing the behaviour of a transient fluid in a pipe with a circular obstacle. Previous research by Chuang et al. [ 69 ] reported that PINNs act as a steady-flow solver, and fail to capture the phenomenon of vortex shedding. Here we challenge these findings and demonstrate that, if properly used, PINNs can successfully simulate the development of vortex shedding in this scenario. Specifically, we consider a fluid with a density of ρ = 1 .0 and describe its behavior using the time-dependent incompressible Navier-Stokes equations 
+
+ut + u∇u + ∇p − νu = 0 , (7.27) 
+
+∇ · u = 0 , (7.28) with u = ( u, v ) defining the velocity field and p the pressure. The kinematic viscosity is taken as ν = 0 .001 .The underlying geometry is identical to Figure 9 and the boundary conditions are the same as the Stokes flow example discussed in section 7.3. However, we introduce a parabolic inflow profile with a maximum velocity of U = 1 .5. As a result, we have characteristic flow velocity and length values of U = 1 .0 and L = 0 .1, respectively, and a Reynolds number of Re = 100 .We begin by normalizing the PDE system as follows: 
+
+x∗ = x
+
+L∗ , t∗ = L∗
+
+U ∗ , u∗ = u
+
+U ∗ , p∗ = pL ∗
+
+νU ∗ , ∇∗ = L∗∇. (7.29) 21 Figure 17: Navier-Stokes flow around cylinder: Predicted velocity field and pressure at T = 1 . the last time step. The animation is provided in https://github.com/PredictiveIntelligenceLab/jaxpi .This leads us to the non-dimensionalized equations: 
+
+u∗ 
+
+> t
+
++ u∗∇∗u + ∇p∗ − 1
+
+Re u∗ = 0 , (7.30) 
+
+∇∗ · u∗ = 0 . (7.31) To obtain a proper initial condition for PINNs, we start with a zero solution and run a numerical simulation for 4
+
+seconds at a very coarse spatial and temporal resolution. We then use the last time-step as our initial condition for the PINNs simulation. Using a time-marching strategy, we partition the temporal domain [0 , 10] into 10 individual time windows. For each window, a modified MLP is employed as our model backbone. PINN training runs for 2 × 10 5 iterations per window following Algorithm 1. Key hyper-parameters are detailed in Table 13. It deserves mentioning that there are more than 10 terms in the total loss and thus it is practically infeasible to manually adjust the weights of each loss. The predicted velocity and pressure field at T = 10 are plotted in Figure 17. For this benchmark, we do not report the test error against the numerical solution, as the start time of vortex shedding in numerical solvers fluctuates based on the underlying discretizations. To the best of our knowledge, our work presents the first empirical evidence of a PINN model being able to capture the phenomenon of vortex shedding. This finding opens up new avenues for further research and application of PINNs in the field of computational fluid dynamics. 
+
+# 8 Conclusions 
+
+In this work, we introduce a comprehensive training pipeline for physics-informed neural networks, addressing various training pathologies such as spectral bias, imbalanced losses, and causality violation. Our pipeline seamlessly integrates essential techniques, including equation non-dimensionalization, Fourier feature embeddings, loss weighting schemes and causal training strategies. Moreover, we explore additional techniques such as a modified MLP architecture, random weight factorization and curriculum training, which can further improve the training stability and model performance. By sharing our empirical findings, we also provide insights into selecting appropriate hyper-parameters associated with network architectures and learning rate schedules in conjunction with the aforementioned algorithms. To demonstrate 22 the effectiveness of the proposed training pipeline, we perform thorough ablation studies on a collection of benchmarks which PINNs often struggle with, and showcase the state-of-the-art results, which we believe should serve as a strong baseline for future studies. By establishing these benchmarks, we hope that our contribution will serve as a cornerstone for more fair and systematic comparisons in the development and adoption of PINN-based methodologies, ultimately propelling PINN research towards more effective and reliable solutions in computational science and engineering. 
+
+# Acknowledgments 
+
+We would like to acknowledge support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DE-SC0019116), the US Air Force (grant AFOSR FA9550-20-1-0060), and US Department of Energy/Advanced Research Projects Agency (grant DE-AR0001201). We also thank the developers of the software that enabled our research, including JAX [63], JAX-CFD[70], Matplotlib [71], and NumPy [72]. 
+
+# References 
+
+[1] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, Eftychios Protopapadakis, et al. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience , 2018, 2018. [2] KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of artificial intelligence ,pages 603–649, 2020. [3] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press, 2018. [4] Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate. arXiv preprint arXiv:2301.10343 , 2023. [5] Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214 , 2022. [6] Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Alexander Pritzel, Suman Ravuri, Timo Ewalds, Ferran Alet, Zach Eaton-Rosen, et al. Graphcast: Learning skillful medium-range global weather forecasting. arXiv preprint arXiv:2212.12794 , 2022. [7] Han Wang, Linfeng Zhang, Jiequn Han, and E Weinan. Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics. Computer Physics Communications , 228:178–184, 2018. [8] D. Pfau, J.S. Spencer, A.G. de G. Matthews, and W.M.C. Foulkes. Ab-initio solution of the many-electron schrödinger equation with deep neural networks. Phys. Rev. Research , 2:033429, 2020. [9] Kiersten M Ruff and Rohit V Pappu. Alphafold and implications for intrinsically disordered proteins. Journal of Molecular Biology , 433(20):167208, 2021. [10] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics , 378:686–707, 2019. [11] Maziar Raissi, Alireza Yazdani, and George Em Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science , 367(6481):1026–1030, 2020. [12] Luning Sun, Han Gao, Shaowu Pan, and Jian-Xun Wang. Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation data. Computer Methods in Applied Mechanics and Engineering ,361:112732, 2020. [13] Abhilash Mathews, Manaure Francisquez, Jerry W Hughes, David R Hatch, Ben Zhu, and Barrett N Rogers. Uncov-ering turbulent plasma dynamics via deep learning from partial observations. Physical Review E , 104(2):025205, 2021. [14] Francisco Sahli Costabal, Yibo Yang, Paris Perdikaris, Daniel E Hurtado, and Ellen Kuhl. Physics-informed neural networks for cardiac activation mapping. Frontiers in Physics , 8:42, 2020. [15] Georgios Kissas, Yibo Yang, Eileen Hwuang, Walter R Witschey, John A Detre, and Paris Perdikaris. Machine learning in cardiovascular flows modeling: Predicting arterial blood pressure from non-invasive 4D flow MRI data using physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering , 358:112623, 2020. [16] Zhiwei Fang and Justin Zhan. Deep physical informed neural networks for metamaterial design. IEEE Access ,8:24506–24513, 2019. 23 [17] Yuyao Chen, Lu Lu, George Em Karniadakis, and Luca Dal Negro. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Optics express , 28(8):11618–11633, 2020. [18] Enrui Zhang, Ming Dao, George Em Karniadakis, and Subra Suresh. Analyses of internal structures and defects in materials using physics-informed neural networks. Science advances , 8(7):eabk0644, 2022. [19] Mahmudul Islam, Md Shajedul Hoque Thakur, Satyajit Mojumder, and Mohammad Nasim Hasan. Extraction of material properties through multi-fidelity deep learning from molecular dynamics simulation. Computational Materials Science , 188:110187, 2021. [20] Alexander Kovacs, Lukas Exl, Alexander Kornell, Johann Fischbacher, Markus Hovorka, Markus Gusenbauer, Leoni Breth, Harald Oezelt, Masao Yano, Noritsugu Sakuma, et al. Conditional physics informed neural networks. 
+
+Communications in Nonlinear Science and Numerical Simulation , 104:106041, 2022. [21] Zhiwei Fang. A high-efficient hybrid physics-informed neural networks based on convolutional neural network. 
+
+IEEE Transactions on Neural Networks and Learning Systems , 33(10):5514–5526, 2021. [22] Ehsan Haghighat and Ruben Juanes. Sciann: A keras/tensorflow wrapper for scientific computations and physics-informed deep learning using artificial neural networks. Computer Methods in Applied Mechanics and Engineering ,373:113552, 2021. [23] Jonthan D Smith, Zachary E Ross, Kamyar Azizzadenesheli, and Jack B Muir. Hyposvi: Hypocentre inversion with stein variational inference and physics informed neural networks. Geophysical Journal International ,228(1):698–710, 2022. [24] Oliver Hennigh, Susheela Narasimhan, Mohammad Amin Nabian, Akshay Subramaniam, Kaustubh Tangsali, Max Rietmann, Jose del Aguila Ferrandis, Wonmin Byeon, Zhiwei Fang, and Sanjay Choudhry. Nvidia simnetˆ {TM }:an ai-accelerated multi-physics simulation framework. arXiv preprint arXiv:2012.07938 , 2020. [25] Shengze Cai, Zhicheng Wang, Sifan Wang, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks for heat transfer problems. Journal of Heat Transfer , 143(6), 2021. [26] Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing , 43(5):A3055–A3081, 2021. [27] Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics , 449:110768, 2022. [28] Levi McClenny and Ulisses Braga-Neto. Self-adaptive physics-informed neural networks using a soft attention mechanism. arXiv preprint arXiv:2009.04544 , 2020. [29] Suryanarayana Maddu, Dominik Sturm, Christian L Müller, and Ivo F Sbalzarini. Inverse dirichlet weighting enables reliable training of physics informed neural networks. Machine Learning: Science and Technology ,3(1):015026, 2022. [30] Mohammad Amin Nabian, Rini Jasmine Gladstone, and Hadi Meidani. Efficient training of physics-informed neural networks via importance sampling. Computer-Aided Civil and Infrastructure Engineering , 2021. [31] Arka Daw, Jie Bu, Sifan Wang, Paris Perdikaris, and Anuj Karpatne. Rethinking the importance of sampling in physics-informed neural networks. arXiv preprint arXiv:2207.02338 , 2022. [32] Chenxi Wu, Min Zhu, Qinyang Tan, Yadhu Kartha, and Lu Lu. A comprehensive study of non-adaptive and residual-based adaptive sampling for physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering , 403:115671, 2023. [33] Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. Journal of Computational Physics , 404:109136, 2020. [34] Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (MscaleDNN) for solving Poisson-Boltzmann equation in complex domains. arXiv preprint arXiv:2007.11207 , 2020. [35] Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering , 384:113938, 2021. [36] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems , 33:7462– 7473, 2020. [37] Han Gao, Luning Sun, and Jian-Xun Wang. Phygeonet: Physics-informed geometry-adaptive convolutional neural networks for solving parameterized steady-state pdes on irregular domain. Journal of Computational Physics ,428:110079, 2021. 24 [38] Rizal Fathony, Anit Kumar Sahu, Devin Willmott, and J Zico Kolter. Multiplicative filter networks. In International Conference on Learning Representations , 2021. [39] Ben Moseley, Andrew Markham, and Tarje Nissen-Meyer. Finite basis physics-informed neural networks (fbpinns): a scalable domain decomposition approach for solving differential equations. arXiv preprint arXiv:2107.07871 ,2021. [40] Namgyu Kang, Byeonghyeon Lee, Youngjoon Hong, Seok-Bae Yun, and Eunbyung Park. Pixel: Physics-informed cell representations for fast and accurate pde solvers. arXiv preprint arXiv:2207.12800 , 2022. [41] Chuwei Wang, Shanda Li, Di He, and Liwei Wang. Is l2 physics informed loss always suitable for training physics informed neural network? Advances in Neural Information Processing Systems , 35:8278–8290, 2022. [42] Pao-Hsiung Chiu, Jian Cheng Wong, Chinchun Ooi, My Ha Dao, and Yew-Soon Ong. Can-pinn: A fast physics-informed neural network based on coupled-automatic–numerical differentiation method. Computer Methods in Applied Mechanics and Engineering , 395:114909, 2022. [43] Ehsan Kharazmi, Zhongqiang Zhang, and George Em Karniadakis. hp-vpinns: Variational physics-informed neural networks with domain decomposition. Computer Methods in Applied Mechanics and Engineering , 374:113547, 2021. [44] Ravi G Patel, Indu Manickam, Nathaniel A Trask, Mitchell A Wood, Myoungkyu Lee, Ignacio Tomas, and Eric C Cyr. Thermodynamically consistent physics-informed neural networks for hyperbolic systems. Journal of Computational Physics , 449:110754, 2022. [45] Jeremy Yu, Lu Lu, Xuhui Meng, and George Em Karniadakis. Gradient-enhanced physics-informed neural networks for forward and inverse pde problems. Computer Methods in Applied Mechanics and Engineering ,393:114823, 2022. [46] Hwijae Son, Jin Woo Jang, Woo Jin Han, and Hyung Ju Hwang. Sobolev training for physics informed neural networks. arXiv preprint arXiv:2101.08932 , 2021. [47] Colby L Wight and Jia Zhao. Solving Allen-Cahn and Cahn-Hilliard equations using the adaptive physics informed neural networks. arXiv preprint arXiv:2007.04542 , 2020. [48] Aditi S Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. arXiv preprint arXiv:2109.01050 , 2021. [49] Shaan Desai, Marios Mattheakis, Hayden Joy, Pavlos Protopapas, and Stephen Roberts. One-shot transfer learning of physics-informed neural networks. arXiv preprint arXiv:2110.11286 , 2021. [50] Somdatta Goswami, Cosmin Anitescu, Souvik Chakraborty, and Timon Rabczuk. Transfer learning enhanced physics informed neural network for phase-field modeling of fracture. Theoretical and Applied Fracture Mechanics ,106:102447, 2020. [51] Souvik Chakraborty. Transfer learning based multi-fidelity physics informed deep neural network. Journal of Computational Physics , 426:109942, 2021. [52] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning ,pages 5301–5310, 2019. [53] Sifan Wang, Shyam Sankaran, and Paris Perdikaris. Respecting causality is all you need for training physics-informed neural networks. arXiv preprint arXiv:2203.07404 , 2022. [54] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation . SIAM, 2008. [55] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 249–256, 2010. [56] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278–2324, 1998. [57] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 , 2016. [58] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 , 2019. [59] Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. arXiv preprint arXiv:2003.04560 , 2020. 25 [60] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739 , 2020. [61] Sifan Wang, Hanwen Wang, Jacob H Seidman, and Paris Perdikaris. Random weight factorization improves the training of continuous neural representations. arXiv preprint arXiv:2210.01274 , 2022. [62] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems , 29, 2016. [63] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transforma-tions of Python+NumPy programs, 2018. [64] Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks. Journal of Computational Physics , 435:110242, 2021. [65] N Sukumar and Ankit Srivastava. Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks. arXiv preprint arXiv:2104.08426 , 2021. [66] Lu Lu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, and Steven G Johnson. Physics-informed neural networks with hard constraints for inverse design. arXiv preprint arXiv:2102.04626 , 2021. [67] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. [68] U Ghia, K.N Ghia, and C.T Shin. High-re solutions for incompressible flow using the navier-stokes equations and a multigrid method. Journal of Computational Physics , 48(3):387–411, 1982. [69] Pi-Yueh Chuang and Lorena A Barba. Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration. arXiv preprint arXiv:2205.14249 , 2022. [70] Dmitrii Kochkov, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences , 118(21), 2021. [71] John D Hunter. Matplotlib: A 2D graphics environment. IEEE Annals of the History of Computing , 9(03):90–95, 2007. [72] Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature ,585(7825):357–362, 2020. [73] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198 , 2019. [74] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Advances in Neural Information Processing Systems , pages 4761–4771, 2019. [75] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571–8580, 2018. [76] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems , pages 8141–8150, 2019. [77] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In 
+
+Advances in neural information processing systems , pages 8572–8583, 2019. [78] Arieh Iserles. A first course in the numerical analysis of differential equations . Number 44. Cambridge university press, 2009. 
+
+# A Spectral Bias through the lens of the Neural Tangent Kernel 
+
+We investigate spectral bias [ 52 , 73 , 74 ] in the training behavior of deep fully-connected networks through the lens of Neural Tangent Kernel(NTK) theory. Let fθ (x) be a scalar-valued fully-connected neural network. Given a training 26 data-set {Xtrain , Ytrain }, where Xtrain = ( xi)Ni=1 are inputs and Ytrain = ( yi)Ni=1 are the corresponding labels. We consider a network trained by minimizing the mean square loss 
+
+L(θ) = 1
+
+N
+
+> N
+
+X
+
+> i=1
+
+|fθ (xi) − yi|2. (A.1) Following the derivation of Jacot et al. [75 , 76 ], we can define the resulting neural tangent kernel K, whose ij -th entry is given by 
+
+Kij = K(xi, xj ) = 
+
+ ∂f θ (xi)
+
+∂θ , ∂f θ (xj )
+
+∂θ
+
+
+
+. (A.2) The NTK theory shows that, under gradient descent dynamics with an infinitesimally small learning rate (gradient flow), the kernel K converges to a deterministic kernel K∗ and does not change during training as the width of the network grows to infinity. Furthermore, under the asymptotic conditions stated in Lee et al. [77], we may derive that 
+
+df θ(τ )(Xtrain )
+
+dτ ≈ − K ·  fθ(τ )(Xtrain ) − Ytrain 
+
+ , (A.3) where θ(τ ) denotes the parameters of the network at iteration τ and fθ(τ )(Xtrain ) = ( fθ(τ )(xi)Ni=1 . Then, it directly follows that 
+
+fθ(τ )
+
+ Xtrain ) ≈ (I − e−Kτ  · Ytrain . (A.4) Since the kernel K is positive semi-definite, we can take its spectral decomposition K = QT ΛQ , where Q is an orthogonal matrix whose i-th column is the eigenvector qi of K and Λ is a diagonal matrix whose diagonal entries λi
+
+are the corresponding eigenvalues. Since e−Kt = Qe−Λτ QT , we have 
+
+QT  fθ(τ )(Xtrain ) − Ytrain 
+
+ = −eΛτ QT Ytrain , (A.5) which implies 
+
+
+
+qT
+
+> 1
+
+qT
+
+> 2
+
+...
+
+qTN
+
+ (fθ(τ )(Xtrain ) − Ytrain ) = 
+
+
+
+e−λ1τ
+
+e−λ2τ
+
+. . .
+
+e−λN τ
+
+
+
+qT
+
+> 1
+
+qT
+
+> 2
+
+...
+
+qTN
+
+ Ytrain . (A.6) The above equation shows that the convergence rate of qTi (fθ(τ )(Xtrain ) − Ytrain ) is determined by the i-th eigenvalue 
+
+λi. Moreover, we can decompose the training error into the eigen-space of the NTK as 
+
+fθ(τ )(Xtrain ) − Ytrain =
+
+> N
+
+X
+
+> i=1
+
+(fθ(τ )(Xtrain ) − Ytrain , qi)qi (A.7) 
+
+=
+
+> N
+
+X
+
+> i=1
+
+qTi
+
+ fθ(τ )(Xtrain ) − Ytrain 
+
+ qi (A.8) 
+
+=
+
+> N
+
+X
+
+> i=1
+
+ e−λiτ qTi Ytrain 
+
+ qi. (A.9) Clearly, the network is biased to first learn the target function along the eigen-directions of neural tangent kernel with larger eigenvalues, and then the remaining components corresponding to smaller eigenvalues. Cao et al. [ 73 ] provide a detailed analysis of the convergence rate of these components. For conventional fully-connected neural networks, the eigenvalues of the NTK shrink monotonically as the frequency of the corresponding eigenfunctions increases, yielding a significantly lower convergence rate for high frequency components of the target function [ 52 , 74 ]. This indeed reveals the so-called “spectral bias” [ 52 ] pathology of deep neural networks. More importantly, we may conclude that the eigen-space of the neural tangent kernel characterizes the learnability of a target function by a neural network. 27 Figure 18: Weight factorization transforms loss landscapes and shortens the distance to minima. 
+
+# B Random Weight Factorization 
+
+In this section, we provide an intuitive understanding and some theoretical explanations of random weight factorization. For additional numerical validations of RWF, please see [61]. To provide an intuitive understanding, let us consider the simplest setting of a one-parameter loss function ℓ(w). In this case, the weight factorization can be simplified to w = s · v with two scalars s and v. It is important to note that for any given w̸ = 0 , there exist infinitely many pairs (s, v ) such that w = s · v. These pairs form a family of hyperbolas in the 
+
+sv -plane, with one for each choice of signs for both s and v. Consequently, the loss function in the sv -plane remains constant along these hyperbolas. Figure 18 gives a visual illustration of the difference between the original loss landscape as a function of w versus the loss landscape in the factorized sv -plane. In the left panel, we plot the original loss function as well as an initial parameter point, the local minimum, and the global minimum. The right panel shows how in the factorized parameter space, each of these three points corresponds to two hyperbolas in the sv -plane. Note how the distance between the initialization and the global minima is reduced from the left to the right panel upon an appropriate choice of factorization. The key observation is that the distance between factorizations representing the initial parameter and the global minimum becomes arbitrarily small in the sv -plane for larger values of s. Indeed, we can prove that this holds for any general loss function in arbitrary parameter dimensions. Further details can be found in [35]. 
+
+Theorem B.1. Suppose that L(θ) is the associated loss function of a neural network defined in (4.1) and (4.2) . For a given θ, we define Uθ as the set containing all possible weight factorizations 
+
+Uθ =
+
+n
+
+(s(l), V(l))L+1  
+
+> l=1
+
+: diag( s(l)) · V(l) = W(l), l = 1 , . . . , L + 1 
+
+o
+
+. (B.1) 
+
+Then for any θ, θ ′, we have dist (Uθ , U θ′ ) := min  
+
+> x∈Uθ,y∈Uθ′
+
+∥x − y∥ = 0 . (B.2) 
+
+Proof. Starting from any fixed network parameters θ = {W(l), bl}L+1  
+
+> l=1
+
+, we consider the following weight factorization diag (s(l)) · V(l) = W(l), l = 1 , 2, . . . , L + 1 . (B.3) Next, consider the set of all possible weight factorizations associated with the initialization θ as 
+
+Uθ =
+
+n
+
+(s(l), V(l))L+1  
+
+> l=1
+
+: diag( s(l)) · V(l) = W(l), l = 1 , . . . , L + 1 
+
+o
+
+. (B.4) Let us now define U0 in the factorized parameter space by 
+
+U0 = {(s(l), 0)L+1  
+
+> l=1
+
+: s(l) ∈ Rdl , l = 1 , . . . , L + 1 }. (B.5) Since the network parameters θ are fixed, there exists a constant C(θ) such that 
+
+∥V(l)∥ ≤ ∥W(l)∥∥s(l)∥ ≤ C(θ)
+
+∥s(l)∥ , l = 1 , . . . , L + 1 . (B.6) 28 For any weight factorization (s(l), V(l))L+1  
+
+> l=1
+
+, we can take (s(l), 0) L+1  
+
+> l=1
+
+∈ U∗. By the definition of distance between sets, we obtain dist (Uθ , U 0) = min  
+
+> x∈Uθ,y∈U∗
+
+∥x − y∥ ≤ 
+
+vuutL+1 X
+
+> l=1
+
+∥V(l)∥2 ≤ C(θ)
+
+vuutL+1 X
+
+> l=1
+
+1
+
+∥s(l)∥2 . (B.7) Therefore, for any network parameters θ, θ ′, taking C = max {C(θ), C (θ′)} yields dist (Uθ , U θ′ ) ≤ dist (Uθ , U 0) + dist (U0, U θ′ ) ≤ 2C
+
+vuutL+1 X
+
+> l=1
+
+1
+
+∥s(l)∥2 . (B.8) For l = 1 , . . . , L + 1 , letting s(l) go to infinity, we have dist (Uθ , U θ′ ) = 0 (B.9) As a corollary, let θ denote a network initialization and θ∗ be a proper local minimum, then there exists a weight factorization with large enough scale factors s, such that the distance between θ and θ∗ can be arbitrarily small in the factorized parameter space. A different way to examine the effect of the proposed weight factorization is by studying its associated gradient updates. Recall that a standard gradient descent update with a learning rate η takes the form 
+
+w(k,l ) 
+
+> n+1
+
+= w(k,l ) 
+
+> n
+
+− η ∂L
+
+∂w(k,l )
+
+> n
+
+. (B.10) The following theorem derives the corresponding gradient descent update expressed in the original parameter space for models using the proposed weight factorization. 
+
+Theorem B.2. Under the weight factorization of (4.4) , the gradient descent update is given by 
+
+w(k,l ) 
+
+> n+1
+
+= w(k,l ) 
+
+> n
+
+− η
+
+
+
+[s(k,l ) 
+
+> n
+
+]2 + ∥v(k,l ) 
+
+> n
+
+∥22
+
+ ∂L
+
+∂w(k,l )
+
+> n
+
++ O(η2), (B.11) 
+
+for l = 1 , 2, . . . , L + 1 and k = 1 , 2, . . . , d l.
+
+By comparing (B.10) and (B.11) , we observe that the weight factorization w = s · v re-scales the learning rate of w
+
+by a factor of (s2 + ∥v∥22). Since s, v are trainable parameters, this analysis suggests that this weight factorization effectively assigns a self-adaptive learning rate to each neuron in the network. 
+
+Proof. Suppose that f (k,l ) denotes k-th component of f (l) ∈ Rdl . Under the proposed weight factorization in (4.4) ,differentiating the loss function L with respect to wk,l and s(k,l ), respectively, yields 
+
+s(k,l ) 
+
+> n+1
+
+= s(k,l ) 
+
+> n
+
+− η ∂L
+
+∂s (k,l )
+
+> n
+
+= s(k,l ) 
+
+> n
+
+− η ∂L
+
+∂f (k,l ) · v(k,l ) 
+
+> n
+
+· g(l−1) , (B.12) 
+
+v(k,l ) 
+
+> n+1
+
+= v(k,l ) 
+
+> n
+
+− η ∂L
+
+∂v(k,l )
+
+> n
+
+= v(k,l ) 
+
+> n
+
+− ηs (k,l )
+
+> n
+
+∂L
+
+∂f (k,l ) · g(l−1) . (B.13) Note that 
+
+∂L
+
+∂w(k,l )
+
+> n
+
+= ∂L
+
+∂f (k,l ) · g(l−1) , (B.14) and the update rule of v(k,l ) and s(k,l ) can be re-written as 
+
+s(k,l ) 
+
+> n+1
+
+= s(k,l ) 
+
+> n
+
+− ηv(k,l ) 
+
+> n
+
+· ∂L
+
+∂w(k,l )
+
+> n
+
+, (B.15) 
+
+v(k,l ) 
+
+> n+1
+
+= v(k,l ) 
+
+> n
+
+− ηs (k,l )
+
+> n
+
+∂L
+
+∂w(k,l )
+
+> n
+
+. (B.16) Since w(k,l ) = s(k,l ) · v(k,l ), the update rule of w(k,l ) is given by 
+
+w(k,l ) 
+
+> n+1
+
+= w(k,l ) 
+
+> n
+
+− η
+
+
+
+[s(k,l ) 
+
+> n
+
+]2 + ∥v(k,l ) 
+
+> n
+
+∥22
+
+ ∂L
+
+∂w(k,l )
+
+> n
+
++ O(η2). (B.17) 29 C PINNs can violate causality 
+
+To illustrate that PINNs can violate causality, we closely examine the minimization of the PDE residual loss Lr (see equation 2.8). Before doing so, let us introduce some notation for convenience. Suppose that 0 = t1 < t 2 < · · · <tNt = T discretizes the temporal domain, and {xj }Nx 
+
+> j=1
+
+discretizes the spatial domain Ω. Now, for a given collection of spatial locations {xj }Nx
+
+> j=1
+
+, we can define the temporal residual loss as 
+
+Lr (t, θ ) = 1
+
+NxNxX
+
+> j=1
+
+| ∂uθ
+
+∂t (t, xj ) + N [uθ ]( t, xj )|2, (C.1) For a specified set of parameters θ and a short time interval [t∗, t ∗ + ∆ t], the PDE residual loss Lr (t∗, θ ) essentially measures the deviation from the solution of the corresponding PDE 
+
+N [u]( t, x ) = 0 , (t, x ) ∈ [t∗, t ∗ + ∆ t] × Ω (C.2) 
+
+u(t∗, x ) = uθ (t∗, x ), x ∈ Ω (C.3) Here, uθ (t∗, x ) represents the network’s prediction using the given fixed parameters θ evaluated at t∗. As a result, even if Lr (t∗, θ ) = 0 , the accuracy of the predicted solution in [t∗, t ∗ + ∆ t] is determined by the deviation of uθ (t∗, x )
+
+from the ground truth and thus the error will propagate alone the time. Hence, we argue that the temporal residual loss 
+
+Lr (t, θ ) should be based on the current predicted solution at time t and the optimization process is meaningful only if the predicted solution is reasonable for previous times. Note that the residual loss 2.8 can be rewritten as 
+
+Lr (θ) = 1
+
+NtNtX
+
+> i=1
+
+Lr (ti, θ ) (C.4) 
+
+= 1
+
+NtNxNtX
+
+> i=1
+> Nx
+
+X
+
+> j=1
+
+| ∂uθ
+
+∂t (ti, xj ) + N [uθ ]( ti, xj )|2. (C.5) Next, we discretize ∂uθ 
+
+> ∂t
+
+using the forward Euler scheme [ 78 ]. For any 1 ≤ i ≤ Nt − 1, L(ti, θ ) can be approximated by 
+
+Lr (ti, θ ) ≈ 1
+
+NxNxX
+
+> j=1
+
+uθ (ti, xj ) − uθ (ti−1, xj )∆t + N [uθ ]( ti, xj )
+
+> 2
+
+≈ |Ω|
+
+∆t2
+
+Z
+
+> Ω
+
+|uθ (ti, x) − uθ (ti−1, x) + ∆ tN [uθ ]( ti, x)|2dx. (C.6) From the above expression, we immediately obtain that the proper minimization of Lr (ti, θ ) should be based on the correct prediction of both uθ (ti, x) and uθ (ti−1, x), while the original formulation tends to minimize all Lr (ti, θ )
+
+simultaneously. As a result, the residual loss Lr (ti, θ ) will be minimized even if the predictions at ti and previous times are inaccurate. This behavior inevitably violates temporal causality, making the PINN model susceptible to learn erroneous solutions. 0 1 2 3Iteration ×10 5        
+
+> 10 −7
+> 10 −3
+> 10 1
+> Loss
+> Lics
+> Lres
+> 0123Iteration ×10 5
+> 020 40 60 Value
+> λics
+> λres
+> 0123Iteration ×10 5
+> 0.00 0.25 0.50 0.75 1.00 min  w(t)
+
+Figure 19: Allen-Cahn equation: left: Loss convergence of the initial condition loss Lics and the PDE residual loss 
+
+Lres during training. Mid: Changes of loss weights during training. Right: Minimum value of the temporal residual 
+
+min t w(t)duringtraining . All temporal PDE residuals are properly minimized if min t w(t) converge to 1. 30 Table 6: Allen-Cahn equation: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 4Layer size 256 Activation Tanh Fourier feature scale 2.0 RWF μ = 0 .5, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 5,000 Training steps 300,000 Batch size 8,192 
+
+Weighting Parameters 
+
+Weighting scheme NTK Causal tolerance 1.0 Number of chunks 32 Table 7: Advection equation: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 4Layer size 256 Activation Tanh Fourier feature scale 1.0 RWF μ = 1 .0, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 2,000 Training steps 200,000 Batch size 8,192 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm Causal tolerance 1.0 Number of chunks 32 31 0 1 2Iteration ×10 5      
+
+> 10 −7
+> 10 −4
+> 10 −1
+> 10 2
+> Loss
+> Lics
+> Lres
+> 012Iteration ×10 5
+> 0500 1000 Value
+> λics
+> λres
+> 012Iteration ×10 5
+> 0.00 0.25 0.50 0.75 1.00 min  w(t)
+
+Figure 20: Advection equation: left: Loss convergence of the initial condition loss Lics and the PDE residual loss 
+
+Lres during training. Mid: Changes of loss weights during training. Right: Minimum value of the temporal residual 
+
+min t w(t)duringtraining . All temporal PDE residuals are properly minimized if min t w(t) converge to 1. Table 8: Stokes equation: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 4Layer size 256 Activation GeLU Fourier feature scale 10.0 RWF μ = 0 .5, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 2,000 Training steps 100,000 Batch size 8,192 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm 0.0 0.5 1.0Iteration ×10 5
+
+10 −10 
+
+10 −7
+
+10 −4
+
+10 −1
+
+> Loss
+
+0.0 0.5 1.0Iteration ×10 5
+
+10 0
+
+10 1
+
+10 2
+
+10 3
+
+> Weight  λ
+> u in v in u out v out u noslip v noslip ru rv rc
+
+Figure 21: Stokes equation: left: Loss convergence of the initial condition loss Lics and the PDE residual loss Lres 
+
+during training. Right: Changes of loss weights during training. 32 Table 9: Kuramoto–Sivashinsky equation: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 5Layer size 256 Activation Tanh Fourier feature scale 1.0 RWF μ = 0 .5, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 2,000 Number of time windows 10 Training steps per window 200,000 Batch size 4,096 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm Causal tolerance 1.0 Number of chunks 16 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6                  
+
+> 10 −6
+> 10 −2
+> 10 2
+> Loss
+> ics res
+> 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6
+> 10 1
+> 10 3
+> Weight  λ
+> 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6
+> 0.00.51.0min  w(t)
+
+Figure 22: Kuramoto–Sivashinsk equation: Top: Loss convergence of the initial condition loss Lics and the PDE residual loss Lres during training. Mid: Changes of loss weights during training. Bottom: Changes of causal weights during training. 33 Table 10: Lid-driven cavity flow: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 5Layer size 256 Activation Tanh Fourier feature scale 10.0 RWF μ = 1 .0, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 10,000 Batch size 8,192 
+
+Curriculum Training 
+
+Re [100, 400, 1,000, 3,200] Training steps [50,000, 50,000, 10,0000, 50,0000] 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm 
+
+Ablation Settings Performance Fourier Feature RWF Grad Norm Modified MLP Rel. L2 error Run time (min) 
+
+✓ ✓ ✓ ✓ 1.34 × 10 −1 58.86 
+
+✗ ✓ ✓ ✓ 7.32 × 10 −1 51.28 
+
+✓ ✗ ✓ ✓ 1.59 × 10 −1 62.01 
+
+✓ ✓ ✗ ✓ 3.38 × 10 −1 57.16 
+
+✓ ✓ ✓ ✗ 5.48 × 10 −1 23.40 
+
+✗ ✗ ✗ ✗ 7.94 × 10 −1 17.96 Table 11: Lid-driven cavity flow: Relative L2 error and run time for an ablation study illustrating the impact of disabling nondimensionalization and individual components of the proposed training pipeline. The error is measured against the norm of flow velocity ∥u∥2 = √u2 + v2 The first row corresponds to the fine-tuned model using Modified MLP whose hyper-parameter configuration can be found in Table 10. 0.0 0.5 1.0 1.5 2.0Iteration ×10 5
+
+> 10 −1
+> 10 0
+> Rel.  L2 Error
+> Plain Default No RWF No Grad Norm No Fourier Feature No Modified MLP
+
+Figure 23: Lid-driven cavity: Convergence of relative L2 error for the ablation study with different components disabled. 34 Table 12: Navier-Stokes flow in torus: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 4Layer size 256 Activation Tanh Fourier feature scale 1.0 RWF μ = 0 .5, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 2,000 Number of time windows 5Training steps per window 150,000 Batch size 8,192 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm Causal tolerance 1.0 Number of chunks 16 0 1 2 3 4 5 6 7Iteration ×10 5                
+
+> 10 −7
+> 10 −4
+> 10 −1
+> Loss
+> u ic v ic w ic rm rc
+> 01234567Iteration ×10 5
+> 10 0
+> 10 1
+> 10 2
+> Weight  λ
+> 01234567Iteration ×10 5
+> 0.00.51.0min  w(t)
+
+Figure 24: Navier-Stokes flow in torus: Top: Loss convergence of the initial condition loss Lics and the PDE residual loss Lres during training. Mid: Changes of loss weights during training. Bottom: Changes of causal weights during training. 35 Table 13: Navier-Stokes flow around a cylinder: Hyper-parameter configuration. 
+
+Parameter Value Architecture Parameters 
+
+Architecture Modified MLP Number of layers 5Layer size 256 Activation Tanh Fourier feature scale 1.0 RWF μ = 1 .0, σ = 0 .1
+
+Training Parameters 
+
+Learning rate 0.001 
+
+Decay steps 2,000 Number of time windows 10 Training steps per window 200,000 Batch size 8,192 
+
+Weighting Parameters 
+
+Weighting scheme Grad Norm Causal tolerance 1.0 Number of chunks 16 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6                  
+
+> 10 −7
+> 10 −4
+> 10 −1
+> Loss
+> u ic v ic p ic u in v in u out v out u noslip v noslip ru rv rc
+> 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6
+> 10 0
+> 10 2
+> Weight  λ
+> 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Iteration ×10 6
+> 0.00 0.25 0.50 0.75 1.00 min  w(t)
+
+Figure 25: Navier-Stokes flow around cylinder: Top: Loss convergence of the initial condition loss Lics and the PDE residual loss Lres during training. Mid: Changes of loss weights during training. Bottom: Changes of causal weights during training. 36