<!-- HTML file automatically generated from DocOnce source (https://github.com/doconce/doconce/)
doconce format html NTNU.do.txt --no_mako -->
<!-- dom:TITLE: Artificial intelligence and machine learning in physics  -->

# Artificial intelligence and machine learning in physics 
**Morten Hjorth-Jensen**, Department of Physics and Astronomy and FRIB, Michigan State University, USA, and Department of Physics and Center for Computing in Science Education, University of Oslo, Norway

Date: **Department of Physics, NTNU, March 1, 2024**

## What is this talk about?
The main emphasis is to give you a short and pedestrian introduction to the whys and hows we can use (with several examples) machine learning methods
in physics. And why this could (or should) be of interest.

These slides and more at <http://mhjensenseminars.github.io/MachineLearningTalk/doc/pub/NTNU/ipynb/NTNU.ipynb>

## Thanks to many

Jane Kim (MSU), Julie Butler (MSU), Patrick Cook (MSU), Danny Jammooa (MSU), Daniel Bazin (MSU), Dean Lee (MSU), Witek Nazarewicz (MSU), Michelle Kuchera (Davidson College), Even Nordhagen (UiO), Robert Solli (UiO, Expert Analytics), Bryce Fore (ANL), Alessandro Lovato (ANL), Stefano Gandolfi (LANL), Francesco Pederiva (UniTN), and Giuseppe Carleo (EPFL). 
Niyaz Beysengulov and Johannes Pollanen (experiment, MSU); Zachary Stewart, Jared Weidman, and Angela Wilson (quantum chemistry, MSU)
Jonas Flaten, Oskar, Leinonen, Øyvind Sigmundson Schøyen, Stian Dysthe Bilek, and Håkon Emil Kristiansen (UiO). Marianne Bathen and Lasse Vines (experiments (UiO). Excuses to those I have omitted.

## And sponsors

1. National Science Foundation, US (various grants)

2. Department of Energy, US (various grants)

3. Research Council of Norway (various grants) and my employers University of Oslo and Michigan State University

## AI/ML and some statements you may have heard (and what do they mean?)

1. Fei-Fei Li on ImageNet: **map out the entire world of objects** ([The data that transformed AI research](https://cacm.acm.org/news/219702-the-data-that-transformed-ai-research-and-possibly-the-world/fulltext))

2. Russell and Norvig in their popular textbook: **relevant to any intellectual task; it is truly a universal field** ([Artificial Intelligence, A modern approach](http://aima.cs.berkeley.edu/))

3. Woody Bledsoe puts it more bluntly: **in the long run, AI is the only science** (quoted in Pamilla McCorduck, [Machines who think](https://www.pamelamccorduck.com/machines-who-think))

If you wish to have a critical read on AI/ML from a societal point of view, see [Kate Crawford's recent text Atlas of AI](https://www.katecrawford.net/). And you have a local and very popular expert, [Inga Strumke, Machines which think](https://www.goodreads.com/en/book/show/144711751)

**Here: with AI/ML we intend a collection of machine learning methods with an emphasis on statistical learning and data analysis**

## Types of machine learning

The approaches to machine learning are many, but are often split into two main categories. 
In *supervised learning* we know the answer to a problem,
and let the computer deduce the logic behind it. On the other hand, *unsupervised learning*
is a method for finding patterns and relationship in data sets without any prior knowledge of the system.

An important  third category is  *reinforcement learning*. This is a paradigm 
of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, 
solely from rewards and punishment.

## Main categories
Another way to categorize machine learning tasks is to consider the desired output of a system.
Some of the most common tasks are:

  * Classification: Outputs are divided into two or more classes. The goal is to   produce a model that assigns inputs into one of these classes. An example is to identify  digits based on pictures of hand-written ones. Classification is typically supervised learning.

  * Regression: Finding a functional relationship between an input data set and a reference data set.   The goal is to construct a function that maps input data to continuous output values.

  * Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand.  It is thus a form of unsupervised learning.

## The plethora  of machine learning algorithms/methods

1. Deep learning: Neural Networks (NN), Convolutional NN, Recurrent NN, Boltzmann machines, autoencoders and variational autoencoders  and generative adversarial networks, stable diffusion and many more generative models

2. Bayesian statistics and Bayesian Machine Learning, Bayesian experimental design, Bayesian Regression models, Bayesian neural networks, Gaussian processes and much more

3. Dimensionality reduction (Principal component analysis), Clustering Methods and more

4. Ensemble Methods, Random forests, bagging and voting methods, gradient boosting approaches 

5. Linear and logistic regression, Kernel methods, support vector machines and more

6. Reinforcement Learning; Transfer Learning and more

## What Is Generative Modeling?

Generative modeling can be broadly defined as follows:

Generative modeling is a branch of machine learning that involves
training a model to produce new data that is similar to a given
dataset.

What does this mean in practice? Suppose we have a dataset containing
photos of horses. We can train a generative model on this dataset to
capture the rules that govern the complex relationships between pixels
in images of horses. Then we can sample from this model to create
novel, realistic images of horses that did not exist in the original
dataset.

## Example of generative modeling, [taken from Generative Deep Learning by David Foster](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/ch01.html)

<!-- dom:FIGURE: [figures/generativelearning.png, width=900 frac=1.0] -->
<!-- begin figure -->

<img src="figures/generativelearning.png" width="900"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Generative Modeling

In order to build a generative model, we require a dataset consisting
of many examples of the entity we are trying to generate. This is
known as the training data, and one such data point is called an
observation.

Each observation consists of many features. For an image generation
problem, the features are usually the individual pixel values; for a
text generation problem, the features could be individual words or
groups of letters. It is our goal to build a model that can generate
new sets of features that look as if they have been created using the
same rules as the original data. Conceptually, for image generation
this is an incredibly difficult task, considering the vast number of
ways that individual pixel values can be assigned and the relatively
tiny number of such arrangements that constitute an image of the
entity we are trying to generate.

## Generative Versus Discriminative Modeling

In order to truly understand what generative modeling aims to achieve
and why this is important, it is useful to compare it to its
counterpart, discriminative modeling. If you have studied machine
learning, most problems you will have faced will have most likely been
discriminative in nature.

## Example of discriminative modeling, [taken from Generative Deeep Learning by David Foster](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/ch01.html)

<!-- dom:FIGURE: [figures/standarddeeplearning.png, width=900 frac=1.0] -->
<!-- begin figure -->

<img src="figures/standarddeeplearning.png" width="900"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Discriminative Modeling

When performing discriminative modeling, each observation in the
training data has a label. For a binary classification problem such as
our data could be labeled as ones and zeros. Our model then learns how to
discriminate between these two groups and outputs the probability that
a new observation has label 1 or 0

In contrast, generative modeling doesn’t require the dataset to be
labeled because it concerns itself with generating entirely new
data (for example an image), rather than trying to predict a label for say  a given image.

## Taxonomy of generative deep learning, [taken from Generative Deep Learning by David Foster](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/ch01.html)

<!-- dom:FIGURE: [figures/generativemodels.png, width=900 frac=1.0] -->
<!-- begin figure -->

<img src="figures/generativemodels.png" width="900"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Good books with hands-on material and codes
* [Sebastian Rashcka et al, Machine learning with Sickit-Learn and PyTorch](https://sebastianraschka.com/blog/2022/ml-pytorch-book.html)

* [David Foster, Generative Deep Learning with TensorFlow](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/ch01.html)

* [Bali and Gavras, Generative AI with Python and TensorFlow 2](https://github.com/PacktPublishing/Hands-On-Generative-AI-with-Python-and-TensorFlow-2)

All three books have GitHub addresses from where  one can download all codes. We will borrow most of the material from these three texts as well as 
from Goodfellow, Bengio and Courville's text [Deep Learning](https://www.deeplearningbook.org/)

## What are the basic Machine Learning ingredients?
Almost every problem in ML and data science starts with the same ingredients:
* The dataset $\boldsymbol{x}$ (could be some observable quantity of the system we are studying)

* A model which is a function of a set of parameters $\boldsymbol{\alpha}$ that relates to the dataset, say a likelihood  function $p(\boldsymbol{x}\vert \boldsymbol{\alpha})$ or just a simple model $f(\boldsymbol{\alpha})$

* A so-called **loss/cost/risk** function $\mathcal{C} (\boldsymbol{x}, f(\boldsymbol{\alpha}))$ which allows us to decide how well our model represents the dataset. 

We seek to minimize the function $\mathcal{C} (\boldsymbol{x}, f(\boldsymbol{\alpha}))$ by finding the parameter values which minimize $\mathcal{C}$. This leads to  various minimization algorithms. It may surprise many, but at the heart of all machine learning algortihms there is an optimization problem.

## Low-level machine learning, the family of ordinary least squares methods

Our data which we want to apply a machine learning method on, consist
of a set of inputs $\boldsymbol{x}^T=[x_0,x_1,x_2,\dots,x_{n-1}]$ and the
outputs we want to model $\boldsymbol{y}^T=[y_0,y_1,y_2,\dots,y_{n-1}]$.
We assume  that the output data can be represented (for a regression case) by a continuous function $f$
through

$$
\boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon}.
$$

## Setting up the equations

In linear regression we approximate the unknown function with another
continuous function $\tilde{\boldsymbol{y}}(\boldsymbol{x})$ which depends linearly on
some unknown parameters
$\boldsymbol{\theta}^T=[\theta_0,\theta_1,\theta_2,\dots,\theta_{p-1}]$.

The input data can be organized in terms of a so-called design matrix 
with an approximating function $\boldsymbol{\tilde{y}}$

$$
\boldsymbol{\tilde{y}}= \boldsymbol{X}\boldsymbol{\theta},
$$

## The objective/cost/loss function

The  simplest approach is the mean squared error

$$
C(\boldsymbol{\Theta})=\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)^T\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)\right\},
$$

or using the matrix $\boldsymbol{X}$ and in a more compact matrix-vector notation as

$$
C(\boldsymbol{\Theta})=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)\right\}.
$$

This function represents one of many possible ways to define the so-called cost function.

## Training solution

Optimizing with respect to the unknown parameters $\theta_j$ we get

$$
\boldsymbol{X}^T\boldsymbol{y} = \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta},
$$

and if the matrix $\boldsymbol{X}^T\boldsymbol{X}$ is invertible we have the optimal values

$$
\hat{\boldsymbol{\theta}} =\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}.
$$

We say we 'learn' the unknown parameters $\boldsymbol{\theta}$ from the last equation.

## Ridge and LASSO Regression

Our optimization problem is

$$
{\displaystyle \min_{\boldsymbol{\theta}\in {\mathbb{R}}^{p}}}\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)\right\}.
$$

or we can state it as

$$
{\displaystyle \min_{\boldsymbol{\theta}\in
{\mathbb{R}}^{p}}}\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2,
$$

where we have used the definition of  a norm-2 vector, that is

$$
\vert\vert \boldsymbol{x}\vert\vert_2 = \sqrt{\sum_i x_i^2}.
$$

## From OLS to Ridge and Lasso

By minimizing the above equation with respect to the parameters
$\boldsymbol{\theta}$ we could then obtain an analytical expression for the
parameters $\boldsymbol{\theta}$.  We can add a regularization parameter $\lambda$ by
defining a new cost function to be optimized, that is

$$
{\displaystyle \min_{\boldsymbol{\theta}\in
{\mathbb{R}}^{p}}}\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_2^2
$$

which leads to the Ridge regression minimization problem where we
require that $\vert\vert \boldsymbol{\theta}\vert\vert_2^2\le t$, where $t$ is
a finite number larger than zero. We do not include such a constraints in the discussions here.

## Lasso regression

Defining

$$
C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_1,
$$

we have a new optimization equation

$$
{\displaystyle \min_{\boldsymbol{\theta}\in
{\mathbb{R}}^{p}}}\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_1
$$

which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator. 
Here we have defined the norm-1 as

$$
\vert\vert \boldsymbol{x}\vert\vert_1 = \sum_i \vert x_i\vert.
$$

## Lots of room for creativity
Not all the
algorithms and methods can be given a rigorous mathematical
justification, opening up thereby for experimenting
and trial and error and thereby exciting new developments.

A solid command of linear algebra, multivariate theory, 
probability theory, statistical data analysis, optimization algorithms, 
understanding errors and Monte Carlo methods is important in order to understand many of the 
various algorithms and methods.

**Job market, a personal statement**: [A familiarity with ML is almost becoming a prerequisite for many of the most exciting employment opportunities](https://www.analyticsindiamag.com/top-countries-hiring-most-number-of-artificial-intelligence-machine-learning-experts/). And add quantum computing and there you are!

## Selected references
* [Mehta et al.](https://arxiv.org/abs/1803.08823) and [Physics Reports (2019)](https://www.sciencedirect.com/science/article/pii/S0370157319300766?via%3Dihub).

* [Machine Learning and the Physical Sciences by Carleo et al](https://link.aps.org/doi/10.1103/RevModPhys.91.045002)

* [Artificial Intelligence and Machine Learning in Nuclear Physics, Amber Boehnlein et al., Reviews Modern of Physics 94, 031003 (2022)](https://journals.aps.org/rmp/abstract/10.1103/RevModPhys.94.031003) 

* [Dilute neutron star matter from neural-network quantum states by Fore et al, Physical Review Research 5, 033062 (2023)](https://journals.aps.org/prresearch/pdf/10.1103/PhysRevResearch.5.033062)

* [Neural-network quantum states for ultra-cold Fermi gases, Jane Kim et al, Nature Physics Communcication, submitted](https://doi.org/10.48550/arXiv.2305.08831)

* [Message-Passing Neural Quantum States for the Homogeneous Electron Gas, Gabriel Pescia, Jane Kim et al. arXiv.2305.07240,](https://doi.org/10.48550/arXiv.2305.07240)

* [Efficient solutions of fermionic systems using artificial neural networks, Nordhagen et al, Frontiers in Physics 11, 2023](https://doi.org/10.3389/fphy.2023.1061580)

* [Particle Data Group summary on ML methods](https://pdg.lbl.gov/2021/reviews/rpp2021-rev-machine-learning.pdf)

## Machine learning. A simple perspective on the interface between ML and Physics

<!-- dom:FIGURE: [figures/mlimage.png, width=800 frac=1.0] -->
<!-- begin figure -->

<img src="figures/mlimage.png" width="800"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## ML in Nuclear  Physics (or any field in physics)

<!-- dom:FIGURE: [figures/ML-NP.png, width=900 frac=1.0] -->
<!-- begin figure -->

<img src="figures/ML-NP.png" width="900"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Scientific Machine Learning

An important and emerging field is what has been dubbed as scientific ML, see the article by Deiana et al "Applications and Techniques for Fast Machine Learning in Science, Big Data **5**, 787421 (2022):https://doi.org/10.3389/fdata.2022.787421"

The authors discuss applications and techniques for fast machine
learning (ML) in science - the concept of integrating power ML
methods into the real-time experimental data processing loop to
accelerate scientific discovery. The report covers three main areas

1. applications for fast ML across a number of scientific domains;

2. techniques for training and implementing performant and resource-efficient ML algorithms;

3. and computing architectures, platforms, and technologies for deploying these algorithms.

## ML for detectors

<!-- dom:FIGURE: [figures/detectors.png, width=900 frac=1.0] -->
<!-- begin figure -->

<img src="figures/detectors.png" width="900"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Physics driven Machine Learning

Another hot topic is what has loosely been dubbed **Physics-driven deep learning**. See the recent work on [Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nature Machine Learning, vol 3, 218 (2021)](https://www.nature.com/articles/s42256-021-00302-5).

**From their abstract.**

A less known but powerful result is that an NN with a single hidden layer can accurately approximate any nonlinear continuous operator. This universal approximation theorem of operators is suggestive of the structure and potential of deep neural networks (DNNs) in learning continuous operators or complex systems from streams of scattered data. ...  We demonstrate that DeepONet can learn various explicit operators, such as integrals and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations.

## And more

* An important application of AI/ML methods is to improve the estimation of bias or uncertainty due to the introduction of or lack of physical constraints in various theoretical models.

* In theory, we expect to use AI/ML algorithms and methods to improve our knowledge about  correlations of physical model parameters in data for quantum many-body systems. Deep learning methods show great promise in circumventing the exploding dimensionalities encountered in quantum mechanical many-body studies. 

* Merging a frequentist approach (the standard path in ML theory) with a Bayesian approach, has the potential to infer better probabilitity distributions and error estimates. 

* Machine Learning and Quantum Computing is a very interesting avenue to explore. See for example a recent talk by [Sofia Vallecorsa](https://www.youtube.com/watch?v=7WPKv1Q57os&list=PLUPPQ1TVXK7uHwCTccWMBud-zLyvAf8A2&index=5&ab_channel=ECTstar).

## Argon-46 by Solli et al., NIMA 1010, 165461 (2021)

Representations of two events from the
Argon-46 experiment. Each row is one event in two projections,
where the color intensity of each point indicates higher charge values
recorded by the detector. The bottom row illustrates a carbon event with
a large fraction of noise, while the top row shows a proton event
almost free of noise.

<!-- dom:FIGURE: [figures/examples_raw.png, width=500 frac=0.6] -->
<!-- begin figure -->

<img src="figures/examples_raw.png" width="500"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Many-body physics, Quantum Monte Carlo and deep learning
Given a hamiltonian $H$ and a trial wave function $\Psi_T$, the variational principle states that the expectation value of $\langle H \rangle$, defined through

$$
\langle E \rangle =
   \frac{\int d\boldsymbol{R}\Psi^{\ast}_T(\boldsymbol{R})H(\boldsymbol{R})\Psi_T(\boldsymbol{R})}
        {\int d\boldsymbol{R}\Psi^{\ast}_T(\boldsymbol{R})\Psi_T(\boldsymbol{R})},
$$

is an upper bound to the ground state energy $E_0$ of the hamiltonian $H$, that is

$$
E_0 \le \langle E \rangle.
$$

In general, the integrals involved in the calculation of various  expectation values  are multi-dimensional ones. Traditional integration methods such as the Gauss-Legendre will not be adequate for say the  computation of the energy of a many-body system.  **Basic philosophy: Let a neural network find the optimal wave function**

## Quantum Monte Carlo Motivation
**Basic steps.**

Choose a trial wave function
$\psi_T(\boldsymbol{R})$.

$$
P(\boldsymbol{R},\boldsymbol{\alpha})= \frac{\left|\psi_T(\boldsymbol{R},\boldsymbol{\alpha})\right|^2}{\int \left|\psi_T(\boldsymbol{R},\boldsymbol{\alpha})\right|^2d\boldsymbol{R}}.
$$

This is our model, or likelihood/probability distribution function  (PDF). It depends on some variational parameters $\boldsymbol{\alpha}$.
The approximation to the expectation value of the Hamiltonian is now

$$
\langle E[\boldsymbol{\alpha}] \rangle = 
   \frac{\int d\boldsymbol{R}\Psi^{\ast}_T(\boldsymbol{R},\boldsymbol{\alpha})H(\boldsymbol{R})\Psi_T(\boldsymbol{R},\boldsymbol{\alpha})}
        {\int d\boldsymbol{R}\Psi^{\ast}_T(\boldsymbol{R},\boldsymbol{\alpha})\Psi_T(\boldsymbol{R},\boldsymbol{\alpha})}.
$$

## Quantum Monte Carlo Motivation
**Define a new quantity.**

$$
E_L(\boldsymbol{R},\boldsymbol{\alpha})=\frac{1}{\psi_T(\boldsymbol{R},\boldsymbol{\alpha})}H\psi_T(\boldsymbol{R},\boldsymbol{\alpha}),
$$

called the local energy, which, together with our trial PDF yields

$$
\langle E[\boldsymbol{\alpha}] \rangle=\int P(\boldsymbol{R})E_L(\boldsymbol{R},\boldsymbol{\alpha}) d\boldsymbol{R}\approx \frac{1}{N}\sum_{i=1}^NE_L(\boldsymbol{R_i},\boldsymbol{\alpha})
$$

with $N$ being the number of Monte Carlo samples.

## Energy derivatives
The local energy as function of the variational parameters defines now our **objective/cost** function.

To find the derivatives of the local energy expectation value as function of the variational parameters, we can use the chain rule and the hermiticity of the Hamiltonian.  

Let us define (with the notation $\langle E[\boldsymbol{\alpha}]\rangle =\langle  E_L\rangle$)

$$
\bar{E}_{\alpha_i}=\frac{d\langle  E_L\rangle}{d\alpha_i},
$$

as the derivative of the energy with respect to the variational parameter $\alpha_i$
We define also the derivative of the trial function (skipping the subindex $T$) as

$$
\bar{\Psi}_{i}=\frac{d\Psi}{d\alpha_i}.
$$

## Derivatives of the local energy
The elements of the gradient of the local energy are

$$
\bar{E}_{i}= 2\left( \langle \frac{\bar{\Psi}_{i}}{\Psi}E_L\rangle -\langle \frac{\bar{\Psi}_{i}}{\Psi}\rangle\langle E_L \rangle\right).
$$

From a computational point of view it means that you need to compute the expectation values of

$$
\langle \frac{\bar{\Psi}_{i}}{\Psi}E_L\rangle,
$$

and

$$
\langle \frac{\bar{\Psi}_{i}}{\Psi}\rangle\langle E_L\rangle
$$

These integrals are evaluted using MC intergration (with all its possible error sources). Use methods like stochastic gradient or other minimization methods to find the optimal parameters.

## Why Feed Forward Neural Networks (FFNN)?

According to the *Universal approximation theorem*, a feed-forward
neural network with just a single hidden layer containing a finite
number of neurons can approximate a continuous multidimensional
function to arbitrary accuracy, assuming the activation function for
the hidden layer is a **non-constant, bounded and
monotonically-increasing continuous function**.

## Universal approximation theorem

The universal approximation theorem plays a central role in deep
learning.  [Cybenko (1989)](https://link.springer.com/article/10.1007/BF02551274) showed
the following:

Let $\sigma$ be any continuous sigmoidal function such that

$$
\sigma(z) = \left\{\begin{array}{cc} 1 & z\rightarrow \infty\\ 0 & z \rightarrow -\infty \end{array}\right.
$$

Given a continuous and deterministic function $F(\boldsymbol{x})$ on the unit
cube in $d$-dimensions $F\in [0,1]^d$, $x\in [0,1]^d$ and a parameter
$\epsilon >0$, there is a one-layer (hidden) neural network
$f(\boldsymbol{x};\boldsymbol{\Theta})$ with $\boldsymbol{\Theta}=(\boldsymbol{W},\boldsymbol{b})$ and $\boldsymbol{W}\in
\mathbb{R}^{m\times n}$ and $\boldsymbol{b}\in \mathbb{R}^{n}$, for which

$$
\vert F(\boldsymbol{x})-f(\boldsymbol{x};\boldsymbol{\Theta})\vert < \epsilon \hspace{0.1cm} \forall \boldsymbol{x}\in[0,1]^d.
$$

## The approximation theorem in words

**Any continuous function $y=F(\boldsymbol{x})$ supported on the unit cube in
$d$-dimensions can be approximated by a one-layer sigmoidal network to
arbitrary accuracy.**

[Hornik (1991)](https://www.sciencedirect.com/science/article/abs/pii/089360809190009T) extended the theorem by letting any non-constant, bounded activation function to be included using that the expectation value

$$
\mathbb{E}[\vert F(\boldsymbol{x})\vert^2] =\int_{\boldsymbol{x}\in D} \vert F(\boldsymbol{x})\vert^2p(\boldsymbol{x})d\boldsymbol{x} < \infty.
$$

Then we have

$$
\mathbb{E}[\vert F(\boldsymbol{x})-f(\boldsymbol{x};\boldsymbol{\Theta})\vert^2] =\int_{\boldsymbol{x}\in D} \vert F(\boldsymbol{x})-f(\boldsymbol{x};\boldsymbol{\Theta})\vert^2p(\boldsymbol{x})d\boldsymbol{x} < \epsilon.
$$

## More on the general approximation theorem

None of the proofs give any insight into the relation between the
number of of hidden layers and nodes and the approximation error
$\epsilon$, nor the magnitudes of $\boldsymbol{W}$ and $\boldsymbol{b}$.

Neural networks (NNs) have what we may call a kind of universality no matter what function we want to compute.

It does not mean that an NN can be used to exactly compute any function. Rather, we get an approximation that is as good as we want.

## Class of functions we can approximate

The class of functions that can be approximated are the continuous ones.
If the function $F(\boldsymbol{x})$ is discontinuous, it won't in general be possible to approximate it. However, an NN may still give an approximation even if we fail in some points.

## Illustration of a single perceptron model and an FFNN

<!-- dom:FIGURE: [figures/nns.png, width=600 frac=0.7]  In a) we show a single perceptron model while in b) we dispay a network with two  hidden layers, an input layer and an output layer. -->
<!-- begin figure -->

<img src="figures/nns.png" width="600"><p style="font-size: 0.9em"><i>Figure 1: In a) we show a single perceptron model while in b) we dispay a network with two  hidden layers, an input layer and an output layer.</i></p>
<!-- end figure -->

## Our network example, simple percepetron with one input

As as simple example we define now a simple perceptron model with
all quantities given by scalars. We consider only one input variable
$x$ and one target value $y$.  We define an activation function
$\sigma_1$ which takes as input

$$
z_1 = w_1x+b_1,
$$

where $w_1$ is the weight and $b_1$ is the bias. These are the
parameters we want to optimize.  This output is then fed into the
**cost/loss** function, which we here for the sake of simplicity just
define as the squared error

$$
C(x;w_1,b_1)=\frac{1}{2}(a_1-y)^2.
$$

## Optimizing the parameters

In setting up the feed forward and back propagation parts of the
algorithm, we need now the derivative of the various variables we want
to train.

We need

$$
\frac{\partial C}{\partial w_1} \hspace{0.1cm}\mathrm{and}\hspace{0.1cm}\frac{\partial C}{\partial b_1}.
$$

Using the chain rule we find

$$
\frac{\partial C}{\partial w_1}=\frac{\partial C}{\partial a_1}\frac{\partial a_1}{\partial z_1}\frac{\partial z_1}{\partial w_1}=(a_1-y)\sigma_1'x,
$$

and

$$
\frac{\partial C}{\partial b_1}=\frac{\partial C}{\partial a_1}\frac{\partial a_1}{\partial z_1}\frac{\partial z_1}{\partial b_1}=(a_1-y)\sigma_1',
$$

which we later will just define as

$$
\frac{\partial C}{\partial a_1}\frac{\partial a_1}{\partial z_1}=\delta_1.
$$

## Implementing the simple perceptron model

In the example code here we implement the above equations (with explict
expressions for the derivatives) with just one input variable $x$ and
one output variable.  The target value $y=2x+1$ is a simple linear
function in $x$. Since this is a regression problem, we define the cost function to be proportional to the least squares error

$$
C(y,w_1,b_1)=\frac{1}{2}(a_1-y)^2,
$$

with $a_1$ the output from the network.

In [1]:
%matplotlib inline

# import necessary packages
import numpy as np
import matplotlib.pyplot as plt

def feed_forward(x):
    # weighted sum of inputs to the output layer
    z_1 = x*output_weights + output_bias
    # Output from output node (one node only)
    # Here the output is equal to the input
    a_1 = z_1
    return a_1

def backpropagation(x, y):
    a_1 = feed_forward(x)
    # derivative of cost function
    derivative_cost = a_1 - y
    # the variable delta in the equations, note that output a_1 = z_1, its derivatives wrt z_o is thus 1
    delta_1 = derivative_cost
    # gradients for the output layer
    output_weights_gradient = delta_1*x
    output_bias_gradient = delta_1
    # The cost function is 0.5*(a_1-y)^2. This gives a measure of the error for each iteration
    return output_weights_gradient, output_bias_gradient

# ensure the same random numbers appear every time
np.random.seed(0)
# Input variable
x = 4.0
# Target values
y = 2*x+1.0

# Defining the neural network
n_inputs = 1
n_outputs = 1
# Initialize the network
# weights and bias in the output layer
output_weights = np.random.randn()
output_bias = np.random.randn()

# implementing a simple gradient descent approach with fixed learning rate
eta = 0.01
for i in range(40):
    # calculate gradients from back propagation
    derivative_w1, derivative_b1 = backpropagation(x, y)
    # update weights and biases
    output_weights -= eta * derivative_w1
    output_bias -= eta * derivative_b1
# our final prediction after training
ytilde = output_weights*x+output_bias
print(0.5*((ytilde-y)**2))

4.0022640019432767e-07


Running this code gives us an acceptable results after some 40-50 iterations. Note that the results depend on the value of the learning rate.

## Central magic

[Automatic differentian](https://en.wikipedia.org/wiki/Automatic_differentiation)

## Monte Carlo methods and Neural Networks

[Machine Learning and the Deuteron by Kebble and Rios](https://www.sciencedirect.com/science/article/pii/S0370269320305463?via%3Dihub) and
[Variational Monte Carlo calculations of $A\le 4$ nuclei with an artificial neural-network correlator ansatz by Adams et al.](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.127.022502)

**Adams et al**:

$$
H_{LO} =-\sum_i \frac{{\vec{\nabla}_i^2}}{2m_N}
+\sum_{i<j} {\left(C_1  + C_2\, \vec{\sigma_i}\cdot\vec{\sigma_j}\right)
e^{-r_{ij}^2\Lambda^2 / 4 }}
\nonumber
$$

<!-- Equation labels as ordinary links -->
<div id="_auto1"></div>

$$
\begin{equation} 
+D_0 \sum_{i<j<k} \sum_{\text{cyc}}
{e^{-\left(r_{ik}^2+r_{ij}^2\right)\Lambda^2/4}}\,,
\label{_auto1} \tag{1}
\end{equation}
$$

where $m_N$ is the mass of the nucleon, $\vec{\sigma_i}$ is the Pauli
matrix acting on nucleon $i$, and $\sum_{\text{cyc}}$ stands for the
cyclic permutation of $i$, $j$, and $k$. The low-energy constants
$C_1$ and $C_2$ are fit to the deuteron binding energy and to the
neutron-neutron scattering length

## Deep learning neural networks, [Variational Monte Carlo calculations of $A\le 4$ nuclei with an artificial neural-network correlator ansatz by Adams et al.](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.127.022502)

An appealing feature of the neural network ansatz is that it is more general than the more conventional product of two-
and three-body spin-independent Jastrow functions

<!-- Equation labels as ordinary links -->
<div id="_auto2"></div>

$$
\begin{equation}
|\Psi_V^J \rangle = \prod_{i<j<k} \Big( 1-\sum_{\text{cyc}} u(r_{ij}) u(r_{jk})\Big) \prod_{i<j} f(r_{ij}) | \Phi\rangle\,,
\label{_auto2} \tag{2}
\end{equation}
$$

which is commonly used for nuclear Hamiltonians that do not contain tensor and spin-orbit terms.
The above function is replaced by a four-layer Neural Network.

## [Dilute neutron star matter from neural-network quantum states by Fore et al, Physical Review Research 5, 033062 (2023)](https://journals.aps.org/prresearch/pdf/10.1103/PhysRevResearch.5.033062) at density $\rho=0.04$ fm$^{-3}$

<!-- dom:FIGURE: [figures/nmatter.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/nmatter.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Pairing and Spin-singlet and triplet two-body distribution functions at $\rho=0.01$ fm$^{-3}$
<!-- dom:FIGURE: [figures/01_tbd.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/01_tbd.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Pairing and Spin-singlet and triplet two-body distribution functions at $\rho=0.04$ fm$^{-3}$

<!-- dom:FIGURE: [figures/04_tbd.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/04_tbd.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Pairing and Spin-singlet and triplet two-body distribution functions at $\rho=0.08$ fm$^{-3}$
<!-- dom:FIGURE: [figures/08_tbd.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/08_tbd.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## The electron gas in three dimensions with $N=14$ electrons (Wigner-Seitz radius $r_s=2$ a.u.), [Gabriel Pescia, Jane Kim et al. arXiv.2305.07240,](https://doi.org/10.48550/arXiv.2305.07240)

<!-- dom:FIGURE: [figures/elgasnew.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/elgasnew.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## [Efficient solutions of fermionic systems using artificial neural networks, Nordhagen et al, Frontiers in Physics 11, 2023](https://doi.org/10.3389/fphy.2023.1061580)

The Hamiltonian of the quantum dot is given by

$$
\hat{H} = \hat{H}_0 + \hat{V},
$$

where $\hat{H}_0$ is the many-body HO Hamiltonian, and $\hat{V}$ is the
inter-electron Coulomb interactions. In dimensionless units,

$$
\hat{V}= \sum_{i < j}^N \frac{1}{r_{ij}},
$$

with $r_{ij}=\sqrt{\mathbf{r}_i^2 - \mathbf{r}_j^2}$.

Separable Hamiltonian with the relative motion part ($r_{ij}=r$)

$$
\hat{H}_r=-\nabla^2_r + \frac{1}{4}\omega^2r^2+ \frac{1}{r},
$$

Analytical solutions in two and three dimensions ([M. Taut 1993 and 1994](https://journals.aps.org/pra/abstract/10.1103/PhysRevA.48.3561)).

## Generative models: Why Boltzmann machines?

What is known as restricted Boltzmann Machines (RMB) have received a
lot of attention lately.  One of the major reasons is that they can be
stacked layer-wise to build deep neural networks that capture
complicated statistics.

The original RBMs had just one visible layer and a hidden layer, but
recently so-called Gaussian-binary RBMs have gained quite some
popularity in imaging since they are capable of modeling continuous
data that are common to natural images.

Furthermore, they have been used to solve complicated quantum
mechanical many-particle problems or classical statistical physics
problems like the Ising and Potts classes of models.

## The structure of the RBM network

<!-- dom:FIGURE: [figures/RBM.png, width=800 frac=1.0] -->
<!-- begin figure -->

<img src="figures/RBM.png" width="800"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## The network

**The network layers**:
1. A function $\boldsymbol{x}$ that represents the visible layer, a vector of $M$ elements (nodes). This layer represents both what the RBM might be given as training input, and what we want it to be able to reconstruct. This might for example be the pixels of an image, the spin values of the Ising model, or coefficients representing speech.

2. The function $\boldsymbol{h}$ represents the hidden, or latent, layer. A vector of $N$ elements (nodes). Also called "feature detectors".

## Goals

The goal of the hidden layer is to increase the model's expressive
power. We encode complex interactions between visible variables by
introducing additional, hidden variables that interact with visible
degrees of freedom in a simple manner, yet still reproduce the complex
correlations between visible degrees in the data once marginalized
over (integrated out).

**The network parameters, to be optimized/learned**:
1. $\boldsymbol{a}$ represents the visible bias, a vector of same length as $\boldsymbol{x}$.

2. $\boldsymbol{b}$ represents the hidden bias, a vector of same lenght as $\boldsymbol{h}$.

3. $W$ represents the interaction weights, a matrix of size $M\times N$.

## Joint distribution
The restricted Boltzmann machine is described by a Bolztmann distribution

$$
P_{\mathrm{rbm}}(\boldsymbol{x},\boldsymbol{h}) = \frac{1}{Z} \exp{-E(\boldsymbol{x},\boldsymbol{h})},
$$

where $Z$ is the normalization constant or partition function, defined as

$$
Z = \int \int \exp{-E(\boldsymbol{x},\boldsymbol{h})} d\boldsymbol{x} d\boldsymbol{h}.
$$

Note the absence of the inverse temperature in these equations.

## Network Elements, the energy function

The function $E(\boldsymbol{x},\boldsymbol{h})$ gives the **energy** of a
configuration (pair of vectors) $(\boldsymbol{x}, \boldsymbol{h})$. The lower
the energy of a configuration, the higher the probability of it. This
function also depends on the parameters $\boldsymbol{a}$, $\boldsymbol{b}$ and
$W$. Thus, when we adjust them during the learning procedure, we are
adjusting the energy function to best fit our problem.

## Defining different types of RBMs (Energy based models)

There are different variants of RBMs, and the differences lie in the types of visible and hidden units we choose as well as in the implementation of the energy function $E(\boldsymbol{x},\boldsymbol{h})$. The connection between the nodes in the two layers is given by the weights $w_{ij}$. 

**Binary-Binary RBM:**

RBMs were first developed using binary units in both the visible and hidden layer. The corresponding energy function is defined as follows:

$$
E(\boldsymbol{x}, \boldsymbol{h}) = - \sum_i^M x_i a_i- \sum_j^N b_j h_j - \sum_{i,j}^{M,N} x_i w_{ij} h_j,
$$

where the binary values taken on by the nodes are most commonly 0 and 1.

## Gaussian binary

**Gaussian-Binary RBM:**

Another varient is the RBM where the visible units are Gaussian while the hidden units remain binary:

$$
E(\boldsymbol{x}, \boldsymbol{h}) = \sum_i^M \frac{(x_i - a_i)^2}{2\sigma_i^2} - \sum_j^N b_j h_j - \sum_{i,j}^{M,N} \frac{x_i w_{ij} h_j}{\sigma_i^2}.
$$

## Representing the wave function

The wavefunction should be a probability amplitude depending on
 $\boldsymbol{x}$. The RBM model is given by the joint distribution of
 $\boldsymbol{x}$ and $\boldsymbol{h}$

$$
P_{\mathrm{rbm}}(\boldsymbol{x},\boldsymbol{h}) = \frac{1}{Z} \exp{-E(\boldsymbol{x},\boldsymbol{h})}.
$$

To find the marginal distribution of $\boldsymbol{x}$ we set:

$$
P_{\mathrm{rbm}}(\boldsymbol{x}) =\frac{1}{Z}\sum_{\boldsymbol{h}} \exp{-E(\boldsymbol{x}, \boldsymbol{h})}.
$$

Now this is what we use to represent the wave function, calling it a neural-network quantum state (NQS)

$$
\vert\Psi (\boldsymbol{X})\vert^2 = P_{\mathrm{rbm}}(\boldsymbol{x}).
$$

## Define the cost function

Now we don't necessarily have training data (unless we generate it by
using some other method). However, what we do have is the variational
principle which allows us to obtain the ground state wave function by
minimizing the expectation value of the energy of a trial wavefunction
(corresponding to the untrained NQS). Similarly to the traditional
variational Monte Carlo method then, it is the local energy we wish to
minimize. The gradient to use for the stochastic gradient descent
procedure is

$$
C_i = \frac{\partial \langle E_L \rangle}{\partial \theta_i}
	= 2(\langle E_L \frac{1}{\Psi}\frac{\partial \Psi}{\partial \theta_i} \rangle - \langle E_L \rangle \langle \frac{1}{\Psi}\frac{\partial \Psi}{\partial \theta_i} \rangle ),
$$

where the local energy is given by

$$
E_L = \frac{1}{\Psi} \hat{\boldsymbol{H}} \Psi.
$$

## Quantum dots and Boltzmann machines, onebody densities $N=6$, $\hbar\omega=0.1$ a.u.

<!-- dom:FIGURE: [figures/OB6hw01.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/OB6hw01.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Onebody densities $N=30$, $\hbar\omega=1.0$ a.u.
<!-- dom:FIGURE: [figures/OB30hw1.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/OB30hw1.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Onebody densities $N=30$, $\hbar\omega=0.1$ a.u.
<!-- dom:FIGURE: [figures/OB30hw01.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figures/OB30hw01.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Extrapolations and model interpretability

When you hear phrases like **predictions and estimations** and
**correlations and causations**, what do you think of?  May be you think
of the difference between classifying new data points and generating
new data points.
Or perhaps you consider that correlations represent some kind of symmetric statements like
if $A$ is correlated with $B$, then $B$ is correlated with
$A$. Causation on the other hand is directional, that is if $A$ causes $B$, $B$ does not
necessarily cause $A$.

## Physics based statistical learning and data analysis

The above concepts are in some sense the difference between **old-fashioned** machine
learning and statistics and Bayesian learning. In machine learning and prediction based
tasks, we are often interested in developing algorithms that are
capable of learning patterns from given data in an automated fashion,
and then using these learned patterns to make predictions or
assessments of newly given data. In many cases, our primary concern
is the quality of the predictions or assessments, and we are less
concerned about the underlying patterns that were learned in order
to make these predictions.

Physics based statistical learning points however to approaches that give us both predictions and correlations as well as being able to produce error estimates and understand causations.  This leads us to the very interesting field of Bayesian statistics.

## Bayes' Theorem

Bayes' theorem

$$
p(X\vert Y)= \frac{p(X,Y)}{\sum_{i=0}^{n-1}p(Y\vert X=x_i)p(x_i)}=\frac{p(Y\vert X)p(X)}{\sum_{i=0}^{n-1}p(Y\vert X=x_i)p(x_i)}.
$$

The quantity $p(Y\vert X)$ on the right-hand side of the theorem is
evaluated for the observed data $Y$ and can be viewed as a function of
the parameter space represented by $X$. This function is not
necessarily normalized and is normally called the likelihood function.

The function $p(X)$ on the right hand side is called the prior while the function on the left hand side is the called the posterior probability. The denominator on the right hand side serves as a normalization factor for the posterior distribution.

## [Quantified limits of the nuclear landscape](https://journals.aps.org/prc/abstract/10.1103/PhysRevC.101.044307)

Predictions made with eleven global mass model and Bayesian model averaging

<!-- dom:FIGURE: [figures/landscape.jpg, width=800 frac=1.0] -->
<!-- begin figure -->

<img src="figures/landscape.jpg" width="800"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Observations (or conclusions if you prefer)
* Need for AI/Machine Learning in physics, lots of ongoing activities

* To solve many complex problems and facilitate discoveries, multidisciplinary efforts efforts are required involving scientists in  physics, statistics, computational science, applied math and other fields.

* There is a need for  focused AI/ML learning efforts that will benefit accelerator science and experimental and theoretical programs

## More observations
* How do we develop insights, competences, knowledge in statistical learning that can advance a given field?

  * For example: Can we use ML to find out which correlations are relevant and thereby diminish the dimensionality problem in standard many-body  theories?

  * Can we use AI/ML in detector analysis, accelerator design, analysis of experimental data and more?

  * Can we use AL/ML to carry out reliable extrapolations by using current experimental knowledge and current theoretical models?

* The community needs to invest in relevant educational efforts and training of scientists with knowledge in AI/ML. These are great challenges to the CS and DS communities

* Quantum computing and quantum machine learning not discussed here

* Most likely tons of things I have forgotten

## Possible start to raise awareness about ML in your own field
* Make an ML challenge in your own field a la [Learning to discover: the Higgs boson machine learning challenge](https://home.cern/news/news/computing/higgs-boson-machine-learning-challenge). Alternatively go to kaggle.com at <https://www.kaggle.com/c/higgs-boson>

* HEP@CERN and HEP in general have made significant impacts in the field of machine learning and AI. Something to learn from

## Possible questions for discussions

1. How do we incorporate these topics in our education?

2. More difficult: what are the consequences for universities and our educational mission?

## Education

1. Incorporate elements of statistical data analysis and Machine Learning in undergraduate programs

2. Develop courses on Machine Learning and statistical data analysis

3. Build up a series of courses in Quantum Information Technologies (QIT)

4. Modifying contents of present Physics programs or new programs on  Computational Physics and Quantum Technologies

a. study direction/option in **quantum technologies**

b. study direction/option in **Artificial Intelligence and Machine Learning**

c. and more

4. Master of Science/PhD programs in Computational and Data Science

a. UiO has already MSc programs in CS and DS

b. MSU has own graduate programs plus dual degree programs in CS and DS

c. Many other universities are developing or have similar programs

## Possible courses quantum courses

**Topics  in a Bachelor of Science/Master of Science.**

1. General university course on quantum mech and quantum technologies

2. Information Systems 

3. From Classical Information theory to Quantum Information theory

4. Classical vs. Quantum Logic

5. Classical and Quantum Laboratory 

6. Discipline-Based Quantum Mechanics 

7. Quantum Software

8. Quantum Hardware

9. more

## Important Issues to think of

1. Lots of conceptual learning: superposition, entanglement, QIT applications, etc.

2. Coding is indispensable. 

3. Teamwork, project management, and communication are important and highly valued

4. Engagement with industry: guest lectures, virtual tours, co-ops, and/or internships.

## Observations

1. Students do not really know what QIT is.

2. ML/AI seen as black boxes/magic!

3. Students perceive that a graduate degree is necessary to work in QIS. A BSc will help.

## Future Needs/Problems

1. There are already  great needs for specialized people (Ph. D. s, postdocs), but also needs of  people with a broad overview of what is possible in ML/AI and/or QIT.

2. There are not enough potential employees in AI/ML and QIT . It is a supply gap, not a skills gap.

3. A BSc with specialization  is a good place to start

4. It is tremendously important to get everyone speaking the same language. Facility with the vernacular of quantum mechanics is a big plus.

5. There is a huge list of areas where technical expertise may be important. But employers are often more concerned with attributes like project management, working well in a team, interest in the field, and adaptability than in specific technical skills.