23.7 Fisher Structure and Deepening of Information-Complexity Inequality

In the previous article, we established task-perceived information geometry: mapping configurations to visible states through observation operator families, measuring information differences with Jensen-Shannon distance, and proved information dimension is constrained by complexity dimension. But we only touched on local Fisher structure, didn’t deeply explore its geometric meaning and practical applications.

This article will deepen understanding of Fisher information matrix, establish finer information-complexity inequalities, and explore how to use these structures to design optimal observation strategies in practical problems.

Core Questions:

Why is Fisher matrix natural metric of “information sensitivity”?
Under what conditions does information dimension equal complexity dimension?
How to choose task $Q$ to maximize information gain?

This article is based on euler-gls-info/03-discrete-information-geometry.md Sections 4-5 and appendices.

1. Geometric Meaning of Fisher Matrix: Why Is It “Information Sensitivity”?

1.1 Starting from Physical Analogy: Stiffness of Spring

Imagine a spring system:

You move spring near equilibrium position $x_{0}$ , displacement is $Δ x$ ;
Potential energy change is $Δ U \approx \frac{1}{2} k (Δ x)^{2}$ , where $k$ is spring stiffness;
Larger stiffness $k$ , larger potential energy change per unit displacement, system more “sensitive” to displacement.

Fisher matrix plays exactly similar role:

You move from reference configuration $x_{0}$ to $x$ nearby, parameter change is $Δ θ$ ;
Relative entropy change is $D_{Q} (θ ∥0) \approx \frac{1}{2} Δ θ^{⊤} g^{(Q)} Δ θ$ ;
Larger coefficients of Fisher matrix $g^{(Q)}$ , larger information distance per unit parameter change, task $Q$ more “sensitive” to changes in this direction.

Core Insight: Fisher matrix is “stiffness matrix” or “Hessian matrix” in information geometry, its eigenvalues characterize information sensitivity in different directions.

1.2 Everyday Analogy: Taste Sensitivity of Wine Taster

Imagine a wine taster tasting wine:

Reference State $x_{0}$ : A standard red wine;
Parameter Space $θ$ : Wine attributes (acidity, sweetness, tannin content, etc.);
Task $Q$ : Wine taster’s taste test;
Visible State $p_{θ}^{(Q)}$ : Probability distribution of taster’s perception of different flavors.

Components of Fisher matrix $g_{ij}^{(Q)}$ represent:

$g_{11}$ : Sensitivity to acidity changes;
$g_{22}$ : Sensitivity to sweetness changes;
$g_{12}$ : Cross-sensitivity of acidity and sweetness (e.g., sour taste affects perception of sweetness).

Different tasters have different Fisher matrices:

Novice: All coefficients small (insensitive to subtle differences);
Expert: Some coefficients large (e.g., extremely sensitive to tannin).

Core Insight: Fisher matrix quantifies “observer’s resolution ability under some task”.

1.3 Review of Mathematical Definition of Fisher Matrix

Source Theory: euler-gls-info/03-discrete-information-geometry.md Definition 4.1

Definition 1.1 (Local Task Fisher Matrix, from euler-gls-info/03-discrete-information-geometry.md Definition 4.1)

Suppose near configuration $x_{0}$ there exists local parameterization $θ \mapsto p (θ) \in Δ (Y_{Q})$ , such that $p (0) = p_{0} = p_{x_{0}}^{(Q)}$ . Define local Fisher information matrix of task $Q$ as

$g_{ij}^{(Q)} (0) = z \in Y_{Q} \sum p_{0} (z) \partial_{θ_{i}} lo g p (θ) (z)_{θ = 0} \partial_{θ_{j}} lo g p (θ) (z)_{θ = 0} .$

Physical Interpretation:

$lo g p (θ) (z)$ is “log-likelihood”, its gradient $\partial_{i} lo g p$ is called “score function”;
$g_{ij}^{(Q)}$ is covariance of score functions: $g_{ij} = Cov (\partial_{i} lo g p, \partial_{j} lo g p)$ ;
Large covariance means high joint sensitivity to these two parameter directions.

graph TD
    A["Reference Configuration x0<br/>Parameter theta=0"] --> B["Local Perturbation<br/>theta → theta+Delta theta"]
    B --> C["Visible State Change<br/>p(0) → p(Delta theta)"]
    C --> D["Relative Entropy<br/>D_Q(Delta theta||0)"]
    D --> E["Second-Order Expansion<br/>D_Q ≈ (1/2) Delta theta^T g Delta theta"]
    E --> F["Fisher Matrix g_ij<br/>Information Sensitivity"]

    F --> G["Eigenvalues lambda_i<br/>Principal Direction Sensitivity"]
    F --> H["Eigenvectors v_i<br/>Principal Sensitive Directions"]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffd4e1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1

2. Spectral Decomposition of Fisher Matrix: Principal Sensitive Directions

Source Theory: Based on euler-gls-info/03-discrete-information-geometry.md Theorem 4.2

2.1 Meaning of Eigenvalues and Eigenvectors

Fisher matrix $g^{(Q)}$ is symmetric positive semidefinite matrix, can perform spectral decomposition:

$g^{(Q)} = i = 1 \sum k λ_{i} v_{i} v_{i}^{⊤},$

where $λ_{1} \geq λ_{2} \geq \dots \geq λ_{k} \geq 0$ are eigenvalues, $v_{1}, \dots, v_{k}$ are corresponding unit eigenvectors.

Geometric Meaning:

Eigenvector $v_{i}$ : $i$ -th principal sensitive direction, i.e., “when moving in this direction, information change most significant”;
Eigenvalue $λ_{i}$ : Information sensitivity in direction $v_{i}$ , i.e., “relative entropy increment per unit displacement”.

2.2 Everyday Analogy: Camera Field of View

Imagine a camera monitoring a room:

Reference State $x_{0}$ : Empty room;
Parameter Space $θ$ : Positions of objects in room (2D);
Task $Q$ : Camera’s image recognition;
Fisher Matrix $g^{(Q)}$ : Sensitivity to object movement.

If camera faces door:

Principal Direction $v_{1}$ : Horizontal direction (parallel to door);
Principal Eigenvalue $λ_{1}$ : Large (people entering/exiting door, image changes significantly);
Secondary Direction $v_{2}$ : Vertical direction (perpendicular to door);
Secondary Eigenvalue $λ_{2}$ : Smaller (near-far movement, image changes less significantly).

Core Insight: Spectral decomposition of Fisher matrix tells us “which directions of change are most easily detected by task $Q$ ”.

2.3 Information Ellipsoid: Geometric Representation of Distinguishability

Second-order approximation of relative entropy $D_{Q} (θ ∥0) \approx \frac{1}{2} θ^{⊤} g^{(Q)} θ$ defines an ellipsoid:

$E_{ϵ} = {θ : θ^{⊤} g^{(Q)} θ \leq 2 ϵ} .$

Geometric features of this ellipsoid:

Principal Axis Directions: Eigenvectors $v_{i}$ ;
Principal Axis Radii: $r_{i} = 2 ϵ / λ_{i}$ ;
Maximum Sensitive Direction: Direction $v_{1}$ corresponding to largest eigenvalue $λ_{1}$ , ellipsoid most “flat” in this direction;
Minimum Sensitive Direction: Direction $v_{k}$ corresponding to smallest eigenvalue $λ_{k}$ , ellipsoid most “wide” in this direction.

Everyday Interpretation:

If you move little in $v_{1}$ direction, will be detected by task $Q$ (ellipsoid narrow in this direction);
If you move much in $v_{k}$ direction, may still not be detected (ellipsoid wide in this direction).

graph LR
    A["Fisher Matrix g"] --> B["Spectral Decomposition<br/>g = Sigma lambda_i v_i v_i^T"]
    B --> C["Largest Eigenvalue lambda_1<br/>Principal Sensitive Direction v_1"]
    B --> D["Smallest Eigenvalue lambda_k<br/>Least Sensitive Direction v_k"]

    C --> E["Ellipsoid Narrowest Direction<br/>Easily Detected"]
    D --> F["Ellipsoid Widest Direction<br/>Hard to Detect"]

    E --> G["Application: Privacy Protection<br/>Avoid Changes in v_1 Direction"]
    F --> H["Application: Attack Strategy<br/>Perturb in v_k Direction"]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffe1e1
    style D fill:#e1ffe1
    style E fill:#ffd4e1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1

3. Second-Order Expansion of Relative Entropy: Information-Theoretic Version of Cramér-Rao Bound

Source Theory: euler-gls-info/03-discrete-information-geometry.md Theorem 4.2 and Appendix B.1

3.1 Core Theorem

Theorem 3.1 (Fisher Second-Order Form of Relative Entropy, from euler-gls-info/03-discrete-information-geometry.md Theorem 4.2)

Under local parameterization $θ \mapsto p (θ)$ and standard regularity conditions, for sufficiently small $θ \in Θ$ , we have

$D_{Q} (θ ∥0) = D (p (θ) ∥ p (0)) = \frac{1}{2} i, j \sum g_{ij}^{(Q)} (0) θ_{i} θ_{j} + o (∣ θ ∣^{2}) .$

Everyday Interpretation:

This theorem says “relative entropy locally is a quadratic form, coefficient matrix is Fisher matrix”;
Physical analogy: Potential energy near equilibrium point $U (x) \approx \frac{1}{2} x^{⊤} H x$ , where $H$ is Hessian matrix;
Fisher matrix $g^{(Q)}$ is “information Hessian” of relative entropy.

3.2 Proof Strategy (Details in Source Theory Appendix B.1)

Core steps of proof:

Perform Taylor expansion of relative entropy $D (p_{θ} ∥ p_{0}) = \sum_{z} p_{θ} (z) lo g \frac{p _{θ} ( z )}{p _{0} ( z )}$ at $θ = 0$ ;
Zeroth-order term: $D (p_{0} ∥ p_{0}) = 0$ ;
First-order term: By probability normalization condition $\sum_{z} p_{θ} (z) = 1$ , first derivative is zero;
Second-order term: Through second derivative of log-likelihood, simplify to $\frac{1}{2} \sum_{ij} g_{ij} θ_{i} θ_{j}$ , where $g_{ij} = \sum_{z} p_{0} (z) \partial_{i} lo g p \cdot \partial_{j} lo g p$ .

Core Technique: Use identity

$z \sum \partial_{i} p_{θ} (z) \cdot \partial_{j} lo g p_{θ} (z) = z \sum p_{θ} (z) \cdot \partial_{i} lo g p_{θ} (z) \cdot \partial_{j} lo g p_{θ} (z),$

convert mixed derivatives to covariance of score functions.

3.3 Connection with Cramér-Rao Bound

In statistics, Cramér-Rao bound says: For any unbiased estimator $\hat{θ}$ of parameter $θ$ , its covariance matrix satisfies

$Cov (\hat{θ}) \geq g^{- 1},$

where $g$ is Fisher information matrix, inequality means positive semidefinite order of matrices.

Everyday Interpretation:

Larger Fisher matrix, higher precision upper bound of estimation (smaller variance);
Inverse of Fisher matrix $g^{- 1}$ gives “variance lower bound of optimal estimation”.

Connection with Relative Entropy:

Relative entropy $D_{Q} (θ ∥0) \approx \frac{1}{2} θ^{⊤} g θ$ characterizes “when true parameter deviates from $θ_{0}$ , distinguishability of observation distribution”;
Cramér-Rao bound says “given observation data, how high precision can we estimate parameters”;
Both unified through Fisher matrix: Fisher matrix is both “Hessian of information” and “lower bound of estimation precision”.

4. Consistency of Geodesic Distance on Information Manifold and Jensen-Shannon Distance

Source Theory: euler-gls-info/03-discrete-information-geometry.md Theorem 4.5 and Appendix B.2

4.1 Local Consistency Theorem

Theorem 4.1 (Consistency of Local Information Distance, from euler-gls-info/03-discrete-information-geometry.md Theorem 4.5)

Let $x, x_{0} \in X$ such that $Φ_{Q} (x_{0}) = θ_{0}$ , $Φ_{Q} (x) = θ$ , and $θ$ is close to $θ_{0}$ . Then

$d_{JS, Q} (x, x_{0}) = (θ - θ_{0})^{⊤} g_{Q} (θ_{0}) (θ - θ_{0}) + o (∣ θ - θ_{0} ∣) .$

Everyday Interpretation:

Left side $d_{JS, Q}$ is Jensen-Shannon information distance on discrete configuration space;
Right side $(θ - θ_{0})^{⊤} g_{Q} (θ_{0}) (θ - θ_{0})$ is geodesic distance induced by Fisher metric on continuous information manifold;
Theorem says “locally, they are equivalent”.

4.2 Why Square Root?

Note square root in theorem, this corresponds to:

Relative entropy $D_{Q} (θ ∥0) \approx \frac{1}{2} θ^{⊤} g θ$ (quadratic form);
Jensen-Shannon divergence $JS_{Q} (x, y) \approx \frac{1}{4} (θ - θ_{0})^{⊤} g (θ - θ_{0})$ (also quadratic form);
Jensen-Shannon distance $d_{JS, Q} = 2 JS_{Q} \approx \frac{1}{2} θ^{⊤} g θ$ (need square root to satisfy triangle inequality).

Everyday Analogy: This is similar to relationship between “distance squared” and “distance” in Euclidean space:

Distance squared: $∥ x ∥^{2} = x^{⊤} x$ (does not satisfy triangle inequality);
Distance: $∥ x ∥ = x^{⊤} x$ (satisfies triangle inequality).

4.3 Riemann Geometry of Information Manifold

Deep meaning of Theorem 4.1 is: Under Assumption 4.3 (existence of information manifold $(S_{Q}, g_{Q})$ ), discrete information geometry converges in continuous limit to a standard Riemann manifold, whose metric is Fisher information metric $g_{Q}$ .

This means we can use all tools of Riemann geometry to study information geometry:

Geodesics: Paths with shortest information distance;
Curvature: “Degree of bending” of information manifold;
Volume Element: Volume formula of information ball;
Parallel Transport: Preserving “information direction” along paths.

graph TD
    A["Discrete Configuration Space X"] --> B["Map Phi_Q<br/>X → S_Q"]
    B --> C["Continuous Information Manifold<br/>(S_Q, g_Q)"]

    A --> D["Discrete Information Distance<br/>d_JS,Q(x,y)"]
    C --> E["Fisher Metric<br/>g_Q"]
    E --> F["Riemann Distance<br/>d_S_Q(theta,theta')"]

    D -.->|"Locally Equivalent<br/>(Theorem 4.1)"| F

    C --> G["Riemann Geometry Tools"]
    G --> H["Geodesics"]
    G --> I["Curvature"]
    G --> J["Volume Element"]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffd4e1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1
    style I fill:#e1f5ff
    style J fill:#fff4e1

5. Strengthened Form of Information-Complexity Inequality

Source Theory: euler-gls-info/03-discrete-information-geometry.md Proposition 3.4, Proposition 5.1 and Appendices A.1, C.1

5.1 Review of Global Lipschitz Inequality

In Article 23.6, we proved global volume containment relation:

Theorem 5.1 (Information Dimension Constrained by Complexity Dimension, from euler-gls-info/03-discrete-information-geometry.md Proposition 3.4)

Assume there exists constant $L_{Q} > 0$ , such that for all adjacent configurations $x, y$ we have $d_{JS, Q} (x, y) \leq L_{Q} C (x, y)$ , then

$V_{x_{0}}^{info, Q} (R) \leq V_{x_{0}}^{comp} (\frac{R}{L _{Q}}),$

therefore $dim_{info, Q} (x_{0}) \leq dim_{comp} (x_{0})$ .

5.2 Local Lipschitz Inequality

Under information manifold framework, we have finer local version:

Proposition 5.2 (Local Information-Complexity Lipschitz Inequality, from euler-gls-info/03-discrete-information-geometry.md Proposition 5.1)

If there exists constant $L_{Q}^{loc} > 0$ , such that for all adjacent configurations $x, y$ we have

$d_{S_{Q}} (Φ_{Q} (x), Φ_{Q} (y)) \leq L_{Q}^{loc} C (x, y),$

then for any path $γ$ we have

$L_{Q} (γ) \leq L_{Q}^{loc} C (γ) .$

In particular,

$d_{S_{Q}} (Φ_{Q} (x_{0}), Φ_{Q} (x)) \leq L_{Q}^{loc} d_{comp} (x_{0}, x) .$

Everyday Interpretation:

Lipschitz constant $L_{Q}^{loc}$ characterizes “maximum information gain per unit complexity cost”;
If $L_{Q}^{loc}$ is large, means task $Q$ has high “information efficiency”;
If $L_{Q}^{loc}$ is small, means need large complexity cost to obtain little information.

5.3 Conditions for Equality: When Does Information Dimension Equal Complexity Dimension?

Theorem 5.1 gives inequality $dim_{info, Q} \leq dim_{comp}$ , when does equality hold?

Condition 1: Surjective Map $Φ_{Q}$

If map $Φ_{Q} : X \to S_{Q}$ is “surjective” (i.e., each information state corresponds to at least one configuration), and Lipschitz constant holds in both directions:

$c_{1} d_{comp} (x, y) \leq d_{S_{Q}} (Φ_{Q} (x), Φ_{Q} (y)) \leq c_{2} d_{comp} (x, y),$

then information ball volume and complexity ball volume grow at same rate, therefore $dim_{info, Q} = dim_{comp}$ .

Condition 2: Task $Q$ is “Complete”

If task $Q$ contains enough observation operators, such that different configurations must have different visible states under $Q$ (i.e., $p_{x}^{(Q)} \neq = p_{y}^{(Q)}$ for all $x \neq = y$ ), then $Φ_{Q}$ is injective, information geometry inherits all structure of complexity geometry.

Everyday Analogy:

Incomplete Task: Taking photos with low-resolution camera, many details lost, $dim_{info, Q} < dim_{comp}$ ;
Complete Task: Taking photos with high-resolution camera, almost all details preserved, $dim_{info, Q} \approx dim_{comp}$ .

graph TD
    A["Complexity Geometry<br/>(X, d_comp)"] -->|"Map Phi_Q"| B["Information Geometry<br/>(S_Q, d_S_Q)"]

    A --> C["Complexity Dimension<br/>dim_comp"]
    B --> D["Information Dimension<br/>dim_info,Q"]

    C --> E["Inequality<br/>dim_info,Q ≤ dim_comp"]
    D --> E

    E --> F["Equality Conditions"]
    F --> G["Condition 1: Bidirectional Lipschitz<br/>c_1 d_comp ≤ d_S_Q ≤ c_2 d_comp"]
    F --> H["Condition 2: Complete Task<br/>Phi_Q Injective"]

    G --> I["High Information Efficiency<br/>Task Q Almost No Information Loss"]
    H --> I

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffe1e1
    style D fill:#e1ffe1
    style E fill:#ffd4e1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1
    style I fill:#e1f5ff

6. Curvature and Volume Growth of Information Manifold

6.1 Ricci Curvature on Information Manifold

Although source theory doesn’t discuss curvature of information manifold in detail, we can borrow ideas from Article 23.5 about Ricci curvature of complexity geometry.

For information manifold $(S_{Q}, g_{Q})$ , we can define standard Riemann curvature tensor $R_{ijk l}$ and Ricci curvature $Ric_{ij}$ .

Physical Meaning:

Positive Curvature: Volume growth of information ball slower than Euclidean space, means “information highly concentrated”;
Zero Curvature: Volume growth of information ball same as Euclidean space, information manifold locally flat;
Negative Curvature: Volume growth of information ball faster than Euclidean space, means “information highly dispersed”.

6.2 Everyday Analogy: Information Density of City

Imagine “information map” of city:

City Center: High information density, positive Ricci curvature (spherical geometry), contains large amount of information per unit distance;
Suburbs: Low information density, Ricci curvature near zero (planar geometry), less information per unit distance;
Information Desert: Almost no information, negative Ricci curvature (hyperbolic geometry), even walk far cannot see new things.

6.3 Volume Element of Information Manifold

On Riemann manifold $(S_{Q}, g_{Q})$ , volume element given by determinant of metric:

$d V_{Q} = det g_{Q} d θ^{1} \dots d θ^{k} .$

Volume of information ball is

$V_{x_{0}}^{info, Q} (R) \approx \int_{d_{S_{Q}} (θ_{0}, θ) \leq R} det g_{Q} (θ) d^{k} θ .$

Comparison with Complexity Geometry:

Complexity Geometry: Metric $G_{ab}$ induced by single-step cost $C (x, y)$ ;
Information Geometry: Metric $g_{Q}$ induced by observation operator family $O$ and task $Q$ ;
Both coupled through Lipschitz inequality.

7. Optimal Observation Strategy: How to Choose Task $Q$ ?

Source Theory Inspiration: Based on joint action idea from euler-gls-info/03-discrete-information-geometry.md Section 5

7.1 Problem Setting

Assume you have fixed computation budget $T$ (complexity cost upper bound), you want to design an observation task $Q$ , such that maximum information can be obtained within budget.

Formalization: Given complexity constraint $C (γ) \leq T$ , choose task $Q$ such that endpoint information quality $I_{Q} (x_{n})$ is maximized.

7.2 Greedy Strategy: Maximize Local Information Gain

At each step, choose observation $O_{j}$ that maximizes single-step information gain:

$j^{*} = ar g j \in J max d_{S_{{j}}} (Φ_{{j}} (x), Φ_{{j}} (x_{0})),$

i.e., choose observation with “largest information distance from reference configuration $x_{0}$ under current configuration $x$ ”.

Everyday Analogy:

Doctor diagnosing disease: First do most distinguishing test (e.g., if suspect heart disease, do ECG first rather than blood test);
Detective solving case: First investigate most suspicious clue.

7.3 Optimal Task Selection: Information-Complexity Efficiency

For task $Q \subset J$ , define information-complexity efficiency as

$η_{Q} = \frac{dim _{info, Q}}{∣ Q ∣ \cdot c _{Q}},$

where $∣ Q ∣$ is number of observations in task, $c_{Q}$ is average complexity cost per observation.

Interpretation:

$dim_{info, Q}$ : Information dimension provided by task (gain);
$∣ Q ∣ \cdot c_{Q}$ : Total complexity cost of task (cost);
$η_{Q}$ : Information gain per unit cost (efficiency).

Optimal Task: $Q^{*} = ar g max_{Q} η_{Q}$ .

Everyday Analogy:

Choosing exam subjects: In limited review time, prioritize “high cost-performance” subjects (more score improvement per unit time);
Investment decision: In limited funds, prioritize “high return rate” projects.

7.4 Adaptive Observation: Adjust Task Based on Intermediate Results

Finer strategy is “adaptive observation”: Based on observation results of previous steps, dynamically adjust subsequent observation task.

Algorithm Framework:

Initialize: $x_{0}$ , task set $J$ , budget $T$ ;
For $t = 1, 2, \dots$ until budget exhausted:
- Based on current configuration $x_{t - 1}$ and historical observations, choose next observation $j_{t}$ ;
- Execute observation, get result, update configuration $x_{t}$ ;
- Update task $Q_{t} = Q_{t - 1} \cup {j_{t}}$ ;
Output: Final information quality $I_{Q_{T}} (x_{T})$ .

Everyday Analogy:

Doctor diagnosis: Based on preliminary test results, decide if further specialized tests needed;
Machine learning: Active learning, choose next labeled sample based on model uncertainty.

graph TD
    A["Initial Configuration x0<br/>Available Observations J<br/>Budget T"] --> B["Choose Observation j1<br/>max Information Gain"]
    B --> C["Execute Observation<br/>Get Result y1"]
    C --> D["Update Configuration x1<br/>Update Task Q1"]

    D --> E["Choose Observation j2<br/>Based on x1, Q1"]
    E --> F["Execute Observation<br/>Get Result y2"]
    F --> G["Update Configuration x2<br/>Update Task Q2"]

    G --> H["... Continue Until<br/>Budget Exhausted"]
    H --> I["Output Final<br/>Information Quality I_QT(xT)"]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffd4e1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1
    style I fill:#e1f5ff

8. Application Example: Fisher Information Matrix in Deep Learning

8.1 Fisher Matrix of Neural Network Parameters

Consider a neural network:

Configuration Space $X$ : All possible weight configurations $W \in R^{d}$ ;
Task $Q$ : Classification on test set;
Visible State $p_{W}^{(Q)}$ : Model’s output probability distribution $p (y ∣ x; W)$ ;
Fisher Matrix:

$g_{ij}^{(Q)} = E_{x \sim data} [y \sum p (y ∣ x; W) \partial_{W_{i}} lo g p (y ∣ x; W) \partial_{W_{j}} lo g p (y ∣ x; W)] .$

8.2 Applications of Fisher Matrix

Application 1: Natural Gradient Descent

Ordinary gradient descent moves along gradient direction in parameter space, but “distance” in parameter space is not information distance. Natural gradient descent uses inverse of Fisher matrix as metric, moves along “steepest descent direction” in information geometry:

$W_{t + 1} = W_{t} - η (g^{(Q)})^{- 1} \nabla L (W_{t}) .$

Intuitive Explanation:

Ordinary gradient: Unit step in parameter space;
Natural gradient: Unit step in information space, considers information sensitivity of different parameter directions.

Application 2: Model Compression

Parameter directions corresponding to small eigenvalues of Fisher matrix “have small impact on output”, can be safely pruned or quantized, thus compressing model:

Identify small eigenvalue directions: $λ_{i} < ϵ$ ;
Round parameters in these directions: $W_{i} \to round (W_{i})$ ;
Information loss controllable: $Δ D_{Q} \approx \frac{1}{2} \sum_{i : λ_{i} < ϵ} λ_{i} (Δ W_{i})^{2}$ .

Application 3: Uncertainty Estimation

Inverse of Fisher matrix approximates posterior covariance of parameters (Laplace approximation):

$p (W ∣ data) \approx N (W^{*}, (g^{(Q)})^{- 1}),$

where $W^{*}$ is maximum likelihood estimate. This can be used for uncertainty quantification in Bayesian neural networks.

8.3 Everyday Analogy: “Important Directions” in Sculpting

Imagine you’re sculpting a statue:

Parameter Space: All possible sculpting states;
Fisher Matrix: “Impact on overall appearance of statue” in each direction;
Large Eigenvalue Directions: Key details (e.g., facial contours), must carve meticulously;
Small Eigenvalue Directions: Minor details (e.g., background texture), can handle roughly.

Core Insight: Fisher matrix tells you “which parameters important for task, which not”, thus guides optimization, compression, regularization, etc.

9. Information-Complexity Joint Optimization: Preview of Variational Principle

Source Theory: euler-gls-info/03-discrete-information-geometry.md Definition 5.2

Recalling joint action from Article 23.6:

$A_{Q} (γ) = α C (γ) - β I_{Q} (x_{n}) .$

Now we can refine this action using Fisher structure. On information manifold, information quality can be represented as “integral of information distance”:

$I_{Q} (x_{n}) = \int_{γ_{info}} g_{ij}^{(Q)} d θ^{i} d θ^{j},$

where $γ_{info} = Φ_{Q} (γ)$ is information path.

Complete Form of Joint Action:

$A_{Q} [γ] = \int_{γ} [α d C - β g_{ij}^{(Q)} d θ^{i} d θ^{j}] .$

9.2 Expected Form of Euler-Lagrange Equations

Although detailed derivation waits until Articles 23.10-11, we can expect optimal path satisfies some Euler-Lagrange equation, of form:

$\frac{d}{d t} (\frac{\partial L}{\partial θ ˙}) - \frac{\partial L}{\partial θ} = 0,$

where Lagrangian density is

$L (θ, \dot{θ}) = α G_{ab} \dot{θ}^{a} \dot{θ}^{b} - β g_{ij}^{(Q)} \dot{θ}^{i} \dot{θ}^{j} .$

Everyday Interpretation:

First term: Complexity cost (expense);
Second term: Information gain (income);
Euler-Lagrange equation: Continuous version of marginal cost = marginal gain.

9.3 Analogy with Physics: Fermat’s Principle

This variational principle is similar to Fermat’s principle in optics:

Fermat’s Principle: Light travels along path minimizing propagation time;
Information-Complexity Principle: Computation proceeds along path optimizing “complexity-information”.

Mathematical structures of both are identical, both are functional extremum problems of paths.

10. Complete Picture: Information Geometry from Discrete to Continuous

10.1 Multi-Layer Structure Summary

graph TD
    A["Discrete Layer:<br/>Configuration Space X"] --> B["Observation Operators O_j<br/>X → Δ(Y_j)"]
    B --> C["Visible States p_x^(Q)"]

    C --> D["Relative Entropy D_Q(x||y)"]
    D --> E["Jensen-Shannon<br/>Distance d_JS,Q"]

    E --> F["Information Ball B_R^info,Q<br/>Information Dimension dim_info,Q"]

    C --> G["Local Parameterization<br/>theta ↦ p(theta)"]
    G --> H["Fisher Matrix g_ij^(Q)<br/>Second-Order Expansion"]

    H --> I["Continuous Layer:<br/>Information Manifold (S_Q, g_Q)"]
    I --> J["Riemann Geometry<br/>Geodesics, Curvature, Volume Element"]

    F --> K["Inequality<br/>dim_info,Q ≤ dim_comp"]
    K --> L["Lipschitz Coupling<br/>d_S_Q ≤ L_Q · d_comp"]

    J --> M["Joint Optimization<br/>A_Q = alpha·C - beta·I_Q"]
    L --> M

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffd4e1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#e1fff5
    style G fill:#ffe1f5
    style H fill:#f5ffe1
    style I fill:#e1f5ff
    style J fill:#fff4e1
    style K fill:#ffd4e1
    style L fill:#ffe1e1
    style M fill:#e1ffe1

10.2 Core Formula Quick Reference

Concept	Discrete Version	Continuous Version
Visible State	$p_{x}^{(Q)} \in Δ (Y_{Q})$	$p (θ) \in Δ (Y_{Q})$
Relative Entropy	$D_{Q} (x ∥ y)$	$D (p (θ) ∥ p (0))$
Information Distance	$d_{JS, Q} (x, y)$	$(θ - θ_{0})^{⊤} g_{Q} (θ - θ_{0})$
Local Metric	Fisher Matrix $g_{ij}^{(Q)}$	Riemann Metric $g_{Q}$
Second-Order Expansion	$D_{Q} (θ ∥0) \approx \frac{1}{2} θ^{⊤} g θ$	-
Volume	$V^{info, Q} (R) = ∥ B_{R}^{info, Q} ∥$	$\int_{B_{R}} det g_{Q} d^{k} θ$
Dimension	$dim_{info, Q} = lim sup \frac{l o g V}{l o g R}$	$k = dim S_{Q}$

10.3 Comparison with Complexity Geometry

Dimension	Complexity Geometry	Information Geometry
Basic Object	Configuration $x \in X$	Visible State $p_{x}^{(Q)}$
Basic Distance	$d_{comp} (x, y)$	$d_{JS, Q} (x, y)$
Local Metric	Complexity Metric $G_{ab}$	Fisher Metric $g_{ij}^{(Q)}$
Curvature	Discrete Ricci $κ (x, y)$	Riemann Curvature $R_{ijk l}$
Volume Growth	$V^{comp} (T)$	$V^{info, Q} (R)$
Dimension	$dim_{comp}$	$dim_{info, Q}$
Physical Meaning	“How far walked”	“What seen”
Dependence	Task-Independent	Task-Dependent

11. Summary

This article deepens understanding of information geometry, core points:

11.1 Core Concepts

Fisher Matrix $g_{ij}^{(Q)}$ : Hessian of relative entropy, characterizes information sensitivity;
Spectral Decomposition $g = \sum λ_{i} v_{i} v_{i}^{⊤}$ : Principal sensitive directions and sensitivities;
Second-Order Expansion of Relative Entropy $D_{Q} (θ ∥0) \approx \frac{1}{2} θ^{⊤} g θ$ : Local quadratic approximation;
Information Manifold $(S_{Q}, g_{Q})$ : Continuous limit of discrete information geometry;
Local Consistency $d_{JS, Q} \approx θ^{⊤} g_{Q} θ$ : Bridge between discrete and continuous;
Lipschitz Inequality $d_{S_{Q}} \leq L_{Q}^{loc} d_{comp}$ : Information constrained by complexity;
Equality Conditions: Bidirectional Lipschitz or complete task;
Optimal Observation Strategy: Maximize information-complexity efficiency $η_{Q}$ ;
Adaptive Observation: Dynamically adjust task based on intermediate results;
Joint Action $A_{Q} = α C - β I_{Q}$ : Foundation of variational principle.

11.2 Core Insights

Fisher Matrix Is Core of Information Geometry: It is both Hessian of relative entropy, foundation of Cramér-Rao bound, and local representation of Riemann metric;
Spectral Decomposition Reveals Principal Sensitive Directions: Large eigenvalues correspond to “easily detected” directions, small eigenvalues correspond to “hard to detect” directions;
Information-Complexity Inequality Is Resource Constraint: $dim_{info, Q} \leq dim_{comp}$ , information gain constrained by computational resources;
Equality Requires Efficient Task: Complete task or bidirectional Lipschitz;
Optimal Task Selection Is Engineering Problem: Under finite resources, choose observations with highest information-complexity efficiency.

11.3 Everyday Analogy Review

Spring Stiffness: Fisher matrix = information stiffness;
Wine Taster: Taste sensitivity in different directions;
Camera Field of View: Principal sensitive direction = principal viewing direction;
City Information Density: Ricci curvature = information concentration;
Important Directions in Sculpting: Fisher eigenvalues = parameter importance;
Investment Efficiency: Information-complexity efficiency = investment return rate.

11.4 Mathematical Structure

Source Theory: All core content in this article strictly based on euler-gls-info/03-discrete-information-geometry.md Sections 4-5 and Appendices A, B, C.

Key Formulas:

Fisher matrix: $g_{ij}^{(Q)} = \sum_{z} p_{0} (z) \partial_{i} lo g p \partial_{j} lo g p$
Relative entropy expansion: $D_{Q} (θ ∥0) = \frac{1}{2} θ^{⊤} g^{(Q)} θ + o (∣ θ ∣^{2})$
Local consistency: $d_{JS, Q} (x, x_{0}) = (θ - θ_{0})^{⊤} g_{Q} (θ - θ_{0}) + o (∣ θ - θ_{0} ∣)$
Lipschitz inequality: $d_{S_{Q}} (Φ_{Q} (x), Φ_{Q} (y)) \leq L_{Q}^{loc} d_{comp} (x, y)$
Dimension inequality: $dim_{info, Q} (x_{0}) \leq dim_{comp} (x_{0})$
Volume containment: $V_{x_{0}}^{info, Q} (R) \leq V_{x_{0}}^{comp} (R / L_{Q})$

Preview of Next Article: 23.8 Unified Time Scale: Physical Realization of Scattering Master Ruler

In next article, we will introduce unified time scale $κ (ω)$ , which is key bridge connecting complexity geometry, information geometry, and physical spacetime:

Scattering Phase Derivative $κ (ω) = φ^{'} (ω) / π$ : Frequency-dependent “time density”;
Spectral Shift Density $ρ_{rel} (ω) = (2 π)^{- 1} tr Q (ω)$ : Trace of group delay matrix;
Continuous Limit of Single-Step Cost $C (x, y) = \int κ (ω) d μ_{x, y}$ : From discrete to continuous;
Unification of Control Manifold $(M, G)$ and Information Manifold $(S_{Q}, g_{Q})$ : Coupled through unified time scale;
Gromov-Hausdorff Convergence: Discrete complexity geometry converges to continuous control manifold.

Keyboard shortcuts

Meta Theory of the Zeckendorf-Hilbert Universe