Introduction to Pygama Math and Fitting#

The goal of this notebook is to illustrate the conventions of the functions stored in pygama.math and how they can be used to fit data. We will also go over how to write new math distributions.

Set up the Python environment#

[1]:

import pygama.math as math

import matplotlib.pyplot as plt
import numpy as np
import os

plt.rcParams["figure.figsize"] = (14, 4)
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = 12

Statistical Distributions#

All statistical distributions are stored in pygama.math.functions and can be imported through the module pygama.math.distributions. Let’s import a plain-old Gaussian and look at the methods it has associated with it! And then, at the end of this section, we can look at the conventions and how to write a new distribution.

Let’s also fix the shape parameters for the Gaussian.

For comparison’s sake, let’s also import scipy’s version of the Gaussian.

[2]:

from pygama.math.distributions import gaussian
from scipy.stats import norm

mu = 2.5
sigma = 0.7
x = np.linspace(-10, 10, 100)

object_methods = [
    method_name
    for method_name in dir(gaussian)
    if callable(getattr(gaussian, method_name))
]
print(object_methods)

['__call__', '__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '_argcheck', '_argcheck_rvs', '_attach_argparser_methods', '_attach_methods', '_cdf', '_cdf_single', '_cdfvec', '_construct_argparser', '_construct_default_doc', '_construct_doc', '_entropy', '_fit_loc_scale_support', '_fitstart', '_get_support', '_isf', '_logcdf', '_logpdf', '_logpxf', '_logsf', '_mom0_sc', '_mom1_sc', '_mom_integ0', '_mom_integ1', '_moment_error', '_munp', '_nlff_and_penalty', '_nnlf', '_norm_cdf', '_norm_pdf', '_open_support_mask', '_param_info', '_parse_args', '_parse_args_rvs', '_parse_args_stats', '_pdf', '_penalized_nlpsf', '_penalized_nnlf', '_ppf', '_ppf_single', '_ppf_to_solve', '_ppfvec', '_reduce_func', '_rvs', '_sf', '_stats', '_support_mask', '_unpack_loc_scale', '_updated_ctor_param', 'cdf', 'cdf_ext', 'entropy', 'expect', 'fit', 'fit_loc_scale', 'freeze', 'generic_moment', 'get_cdf', 'get_pdf', 'interval', 'isf', 'logcdf', 'logpdf', 'logsf', 'mean', 'median', 'moment', 'nnlf', 'norm_cdf', 'norm_pdf', 'pdf', 'pdf_ext', 'ppf', 'required_args', 'rvs', 'sf', 'stats', 'std', 'support', 'var', 'vecentropy']

Geez, there are a lot of methods. Let’s hone in on a couple: ### .pdf and .cdf

These are just your bread and butter probability density and cumulative density functions

[3]:

gauss_pdf = gaussian.pdf(x, mu, sigma)

plt.plot(x, gauss_pdf, label="Pygama Gaussian PDF", c="k")
plt.plot(x, norm.pdf(x, mu, sigma), label="Scipy Gaussian PDF", ls=":", alpha=1, c="r")
plt.title("PDF of a Gaussian Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_5_0.png

[4]:

gauss_cdf = gaussian.cdf(x, mu, sigma)

plt.plot(x, gauss_cdf, label="Gaussian CDF", c="k")
plt.plot(x, norm.cdf(x, mu, sigma), label="Scipy Gaussian CDF", ls=":", alpha=1, c="r")
plt.title("CDF of a Gaussian Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_6_0.png

Ok, so that’s fine and dandy, what’s so special about a class that reproduces scipy results?

Well, every function in pygama.math.functions subclasses scipy’s rv_continuous class: this allows access to methods that scipy automagically computes. Let’s look at a few

What `rv_continuous` subclassing gets us: `.rvs`, `.logpdf`, `.sf`…#

[5]:

gauss_log_pdf = gaussian.logpdf(x, mu, sigma)

plt.plot(x, gauss_log_pdf, label="Gaussian Log PDF", c="k")
plt.plot(
    x, norm.logpdf(x, mu, sigma), label="Scipy Gaussian Log PDF", ls=":", alpha=1, c="r"
)
plt.title("Log PDF of a Gaussian Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_8_0.png

[6]:

gauss_sf = gaussian.sf(x, mu, sigma)

plt.plot(x, gauss_sf, label="Gaussian SF", c="k")
plt.plot(x, norm.sf(x, mu, sigma), label="Scipy Gaussian SF", ls=":", alpha=1, c="r")
plt.title("Survival Fraction of a Gaussian Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_9_0.png

And now for something really useful, random sampling:

[7]:

random_sample = gaussian.rvs(mu, sigma, size=1000, random_state=1)

hist, bins = np.histogram(random_sample, bins=100, range=(-10, 10))
plt.step(
    bins[1:], hist / np.sum(hist) * np.sum(gauss_pdf), label="Randomly Sampled Data"
)
plt.plot(x, gauss_pdf, label="Gaussian PDF")
plt.title("Comparison of Randomly Sampled Data with Underlying PDF")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_11_0.png

We get access to `scipy`’s `rv_continuous` methods for free#

If we take a look at the actual code for the definition of pygama.math.functions.gaussian we see that there is no direct implementation of methods for .rvs or .logpdf! We get access to all these methods for free. All we have to do is subclass rv_continuous and write definitions for _pdf and _cdf

Now, the drawback: the functions _pdf and _cdf that rv_continuous uses are slow. Let’s see this by comparing to the function that actually computes the pdf: ### But the .pdf and .cdf methods are slow!

[8]:

%%timeit -r 4 -n 10000
gauss_pdf = gaussian.pdf(x, mu, sigma)

117 µs ± 281 ns per loop (mean ± std. dev. of 4 runs, 10,000 loops each)

[9]:

def gaussian_pdf(x, mu, sigma):
    if sigma == 0:
        invs = np.inf
    else:
        invs = 1.0 / sigma
    z = (x - mu) * invs
    invnorm = invs / np.sqrt(2 * np.pi)
    return np.exp(-0.5 * z**2) * invnorm

[10]:

%%timeit -r 4 -n 10000
gauss_pdf = gaussian_pdf(x, mu, sigma)

9 µs ± 122 ns per loop (mean ± std. dev. of 4 runs, 10,000 loops each)

Fast methods: `.get_pdf` and `.get_cdf`#

Because the overloaded rv_continuous implementations of pdf and cdf are very slow, we have implemented fast versions of these two methods, named get_pdf and get_cdf.

Each of these fast methods call a numba.njit-wrapped function — a just-in-time compiled function — that runs super fast.

We keep the .pdf and .cdf because they give us access to scipy methods we might want to use, but we write these faster methods so that we can fit distributions quickly.

[11]:

gauss_get_pdf = gaussian.get_pdf(x, mu, sigma)

plt.plot(x, gauss_get_pdf, label="Pygama Gaussian Get_PDF", c="k")
plt.plot(x, norm.pdf(x, mu, sigma), label="Scipy Gaussian PDF", ls=":", alpha=1, c="r")
plt.title("PDF of a Gaussian Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_17_0.png

Ok, that’s good that it reproduces the scipy results numerically. But how fast does it run?

[12]:

%%timeit -r 4 -n 10000
gauss_pdf = gaussian.get_pdf(x, mu, sigma)

2.66 µs ± 50.2 ns per loop (mean ± std. dev. of 4 runs, 10,000 loops each)

See, we’re even faster than the function we defined above! That’s the power of just-in-time compiled code

The remaining methods are helpful for binned and unbinned fitting: `norm_pdf`, `norm_cdf`, `pdf_ext` and `cdf_ext`#

For pygama our default way of fitting distributions is to use the Iminuit package. Iminuit has several different ways to fit binned and unbinned data by performing either extended or unextended fits. These fitting methods require functions with different properties. The way pygama functions interact with Iminuit is as follows:

norm_pdf is used for unbinned fits, which require a pdf that is normalized to 1 on the fitting range
norm_cdf is used for binned fits, which require a cdf that derived from a pdf that is normalized to 1 on the fitting range
pdf_ext is used for extended unbinned fits, and the function is required to return a tuple of the support-normalized pdf integrated over the data window, and the scaled support-normalized pdf.
cdf_ext is used for extended binned fits, and just returns the scaled support-derived cdf

Let’s illustrate some fitting in action, and why some of these weird definitions are required ### Import Iminuit and create data to fit

[13]:

from iminuit import cost, Minuit

xr = (-4, 4)  # xrange
mu = 0.5
sigma = 3

rng = np.random.default_rng(1)

xdata = rng.normal(mu, sigma, size=1000)
xdata = xdata[(xr[0] < xdata) & (xdata < xr[1])]

n, xe = np.histogram(xdata, bins=50, range=xr)
cx = 0.5 * (xe[1:] + xe[:-1])
dx = np.diff(xe)

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.plot(xdata, np.zeros_like(xdata), "|", alpha=0.1);

../_images/notebooks_MathTutorial_22_0.png

Try an unbinned fit using the correct `norm_pdf` and the incorrect `get_pdf`#

We first note that the method norm_pdf has arguments x_lower, x_upper, mu, sigma where x_lower, x_upper are the bounds of the fitting range to ensure that the pdf is normalized to unity on. So, we therefore fix these fit parameters in Iminuit.

We also show a fit with get_pdf to show that the fit returns the incorrect results! We keep the get_pdf methods in pygama because they are useful if we ever need to numerically compute something quickly, or if we do a least squares fit.

[14]:

c = cost.UnbinnedNLL(xdata, gaussian.norm_pdf)

m_norm = Minuit(c, x_lower=xr[0], x_upper=xr[1], mu=0.4, sigma=0.2)
m_norm.fixed["x_lower", "x_upper"] = True
m_norm.limits["mu", "sigma"] = (0, None)
m_norm.migrad()

[14]:

Migrad
FCN = 3418		Nfcn = 108
EDM = 8.56e-07 (Goal: 0.0002)		time = 0.7 sec
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	x_lower	-4.00	0.04					yes
1	x_upper	4.00	0.04					yes
2	mu	0.43	0.17			0
3	sigma	3.10	0.24			0

	x_lower	x_upper	mu	sigma
x_lower	0	0	0	0
x_upper	0	0	0	0
mu	0	0	0.0287	0.0121 (0.300)
sigma	0	0	0.0121 (0.300)	0.0569

[15]:

c = cost.UnbinnedNLL(xdata, gaussian.get_pdf)

m = Minuit(c, mu=0.4, sigma=0.2)
m.limits["mu", "sigma"] = (0, None)
m.migrad()

[15]:

Migrad
FCN = 3570		Nfcn = 90
EDM = 3.38e-06 (Goal: 0.0002)
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	mu	0.19	0.07			0
1	sigma	2.06	0.05			0

	mu	sigma
mu	0.00508	2.25e-06
sigma	2.25e-06	0.00254

[16]:

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(
    xm, gaussian.norm_pdf(xm, *m_norm.values) * len(xdata) * dx[0], label="norm_pdf fit"
)
plt.plot(xm, gaussian.get_pdf(xm, *m.values) * len(xdata) * dx[0], label="get_pdf fit")
plt.plot(
    xm, gaussian.get_pdf(xm, mu, sigma) * len(xdata) * dx[0], label="True Distribution"
)
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_26_0.png

We see that norm_pdf does a much better job of fitting the actual underlying distribution and reproducing the correct \(\mu,\sigma\) parameters

try an extended unbinned fit#

The parameters for the pdf_ext are area, x_lo, x_hi, mu, sigma

[17]:

c = cost.ExtendedUnbinnedNLL(xdata, gaussian.pdf_ext)

m = Minuit(c, area=500, x_lo=xr[0], x_hi=xr[1], mu=0.1, sigma=0.9)
m.fixed["x_lo", "x_hi"] = True
m.limits["area", "mu", "sigma"] = (0, None)
m.migrad()

[17]:

Migrad
FCN = -6134		Nfcn = 125
EDM = 9.54e-06 (Goal: 0.0002)		time = 0.3 sec
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	area	1.04e3	0.06e3			0
1	x_lo	-4.00	0.04					yes
2	x_hi	4.00	0.04					yes
3	mu	0.43	0.17			0
4	sigma	3.10	0.24			0

	area	x_lo	x_hi	mu	sigma
area	3.42e+03	0	0	3.02 (0.305)	10.9 (0.783)
x_lo	0	0	0	0	0
x_hi	0	0	0	0	0
mu	3.02 (0.305)	0	0	0.0287	0.0122 (0.302)
sigma	10.9 (0.783)	0	0	0.0122 (0.302)	0.0569

[18]:

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, gaussian.pdf_ext(xm, *m.values)[1] * dx[0], label="fit")
plt.plot(
    xm, gaussian.pdf_ext(xm, 1000, xr[0], xr[1], mu, sigma)[1] * dx[0], label="actual"
)
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_29_0.png

The pdf_ext function correctly fits the area of the curve!

Binned fits using `norm_cdf` and the incorrect `get_cdf`#

[19]:

c = cost.BinnedNLL(n, xe, gaussian.norm_cdf)

m_norm = Minuit(c, x_lower=xr[0], x_upper=xr[1], mu=0.4, sigma=0.2)
m_norm.fixed["x_lower", "x_upper"] = True
m_norm.limits["mu", "sigma"] = (0.1, None)
m_norm.migrad()

/home/docs/checkouts/readthedocs.org/user_builds/pygama/envs/refactor/lib/python3.10/site-packages/numpy/lib/function_base.py:1447: RuntimeWarning: invalid value encountered in subtract
  a = op(a[slice1], a[slice2])

[19]:

Migrad
FCN = 64.77 (chi2/ndof = 1.3)		Nfcn = 372
EDM = 34.9 (Goal: 0.0002)		time = 0.6 sec
INVALID Minimum		No Parameters at limit
ABOVE EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	x_lower	-4.00	0.04					yes
1	x_upper	4.00	0.04					yes
2	mu	1.4	1.7			0.1
3	sigma	5.5	1.1			0.1

	x_lower	x_upper	mu	sigma
x_lower	0	0	0	0
x_upper	0	0	0	0
mu	0	0	3.83	2.21 (0.998)
sigma	0	0	2.21 (0.998)	1.28

[20]:

c = cost.BinnedNLL(n, xe, gaussian.get_cdf)

m = Minuit(c, mu=0.4, sigma=0.2)
m.limits["mu", "sigma"] = (0.1, None)
m.migrad()

[20]:

Migrad
FCN = 193.7 (chi2/ndof = 4.0)		Nfcn = 123
EDM = 5.4e-07 (Goal: 0.0002)
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	mu	0.19	0.07			0.1
1	sigma	2.05	0.05			0.1

	mu	sigma
mu	0.00506	-5.54e-07
sigma	-5.54e-07	0.00253

[21]:

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.stairs(
    np.diff(gaussian.norm_cdf(xe, *m_norm.values)) * len(xdata),
    xe,
    label="norm_cdf fit",
)
plt.stairs(
    np.diff(gaussian.get_cdf(xe, *m.values)) * len(xdata), xe, label="get_cdf fit"
)
plt.stairs(
    np.diff(gaussian.get_cdf(xe, mu, sigma)) * len(xdata),
    xe,
    label="underlying distribution",
)
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_33_0.png

Finally, extended binned fits with `cdf_ext`#

[22]:

c = cost.ExtendedBinnedNLL(n, xe, gaussian.cdf_ext)

m = Minuit(c, area=500, mu=0.1, sigma=0.9)
m.limits["area", "mu", "sigma"] = (0, None)
m.migrad()

[22]:

Migrad
FCN = 42.85 (chi2/ndof = 0.9)		Nfcn = 128
EDM = 4.11e-06 (Goal: 0.0002)		time = 0.5 sec
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	area	1.04e3	0.06e3			0
1	mu	0.42	0.17			0
2	sigma	3.08	0.23			0

	area	mu	sigma
area	3.3e+03	2.84 (0.297)	10.4 (0.776)
mu	2.84 (0.297)	0.0279	0.0115 (0.295)
sigma	10.4 (0.776)	0.0115 (0.295)	0.0547

[23]:

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.stairs(np.diff(gaussian.cdf_ext(xe, *m.values)), xe, label="fit")
plt.stairs(np.diff(gaussian.cdf_ext(xe, 1000, mu, sigma)), xe, label="underlying")
plt.legend();

../_images/notebooks_MathTutorial_36_0.png

Writing our own `pygama.math` distribution: conventions and requirements#

Let’s write a Cauchy distribution. Recall that a Cauchy distribution has support over the real line, and has a pdf like

\(pdf(x, \mu,\sigma) = \frac{1}{\pi\sigma\left[1+\left(\frac{x-\mu}{\sigma}\right)^2\right]}\)

and a cdf:

\(cdf(x, \mu,\sigma) = \frac{1}{\pi}\arctan\left(\frac{x-\mu}{\sigma}\right)+\frac{1}{2}\)

What we need to define a distribution class#

We need four functions that our class methods will call and their arguments, and they all should be numbafied

A PDF, normalized to the support. Args: x, mu, sigma
A CDF, derived from the support normalized PDF. Args: x, mu, sigma
A scaled PDF. Args: x, area, mu, sigma
A scaled CDF. Args: x, area, mu, sigma

The convention is to name these functions nb_distribution_pdf and nb_distribution_scaled_cdf for example.

[24]:

import numba as nb

# here we set some parameters to ensure that numba will compute things as quickly as possible
kwd_parallel = {"parallel": True, "fastmath": True}
kwd = {"parallel": False, "fastmath": True}

# define the pdf
@nb.njit(**kwd_parallel)
def nb_cauchy_pdf(x: np.ndarray, mu: float, sigma: float) -> np.ndarray:
    r"""
    Parameters
    ----------
    x
        The input data
    lamb
        The rate
    mu
        The amount to shift the distribution
    sigma
        The amount to scale the distribution
    """

    y = np.empty_like(x, dtype=np.float64)
    # we want to do a loop because it is faster with parallelization
    for i in nb.prange(x.shape[0]):
        y[i] = (x[i] - mu) / sigma
        y[i] = 1 / (np.pi * sigma * (1 + y[i] ** 2))
    return y


# define the cdf
@nb.njit(**kwd_parallel)
def nb_cauchy_cdf(x: np.ndarray, mu: float, sigma: float) -> np.ndarray:
    r"""
    Parameters
    ----------
    x
        The input data
    lamb
        The rate
    mu
        The amount to shift the distribution
    sigma
        The amount to scale the distribution
    """
    y = np.empty_like(x, dtype=np.float64)
    for i in nb.prange(x.shape[0]):
        y[i] = (x[i] - mu) / sigma
        y[i] = (1 / np.pi) * np.arctan(y[i]) + 0.5
    return y


# define the scaled pdf, can't use parallelization here because there is no outer for-loop
@nb.njit(**kwd)
def nb_cauchy_scaled_pdf(
    x: np.ndarray, area: float, mu: float, sigma: float
) -> np.ndarray:
    r"""
    Parameters
    ----------
    x
        The input data
    area
        The prefactor to scale the pdf by
    lamb
        The rate
    mu
        The amount to shift the distribution
    sigma
        The amount to scale the distribution
    """
    return area * nb_cauchy_pdf(x, mu, sigma)


# define the scaled cdf, can't use parallelization here because there is no outer for-loop
@nb.njit(**kwd)
def nb_cauchy_scaled_cdf(
    x: np.ndarray, area: float, mu: float, sigma: float
) -> np.ndarray:
    r"""
    Parameters
    ----------
    x
        The input data
    area
        The prefactor to scale the pdf by
    lamb
        The rate
    mu
        The amount to shift the distribution
    sigma
        The amount to scale the distribution
    """
    return area * nb_cauchy_cdf(x, mu, sigma)

Now we create a class using these four functions#

Each pygama distribution class subclasses pygama_continuous, which itself is a subclass of rv_continuous.

An aside on the ordering of parameters:#

In the following the term “shape parameter” refers to a specific variable used to define a distribution that is not an overall scaling (which I call \(\mu\)) or shifting (which I call \(\sigma\)). An example of shape parameters would be the \(\beta, m\) parameters in a crystal ball function. The location and scale (\(\mu, \sigma\)) parameters have a more specific definition — akin to what scipy does. If a \(\mu, \sigma\) value is present, then the entire distribution is shifted and scaled; this can be achieved by transforming \(y = \frac{x-\mu}{\sigma}\) and computing \(pdf(y)/\sigma\) ( where the factor of \(1/\sigma\) comes from the Jacobian).

The ordering of parameters is as follows, using whichever are needed for a function: x, area, x_lower, x_upper, shapes, ..., mu, sigma

Where x_lower and x_upper are the lower and upper bounds to evaluate a function on.

Now, back to require methods#

The class name has the format distribution_name_gen and it must include the following methods with their arguments:

_pdf(self, x, shapes):

This is an overloading of the scipy method for the pdf. Because scipy first computes \(\frac{x-\mu}{\sigma}\) and \(pdf/\sigma\), we need to call nb_distribution_pdf with \(\mu = 0\) and \(\sigma = 1\). If shape parameters are present, nb_distribution_pdf must also be called with shape_param[0] because scipy passes an array of shape parameters to the function calls. Finally, x.flags.writeable = True must be passed so that numba can operate as expected on arrays.
_cdf(self, x, shapes):

This is an overloading of the scipy method for the cdf. The same format applies as for the pdf.
get_pdf(x, shapes, mu, sigma):

This is a direct call to the fast nb_distribution_pdf
get_pdf(x, shapes, mu, sigma):

This is a direct call to the fast nb_distribution_cdf
norm_pdf(x, x_lower, x_upper, shapes, mu, sigma):

In the case that the support of the distribution is the entire real line, this calls a pygama_continuous super method that normalizes the pdf to unity on the range \([x_{lower},x_{upper}]\) by computing \(\frac{pdf(x)}{cdf(x_{upper})- cdf(x_{lower})}\). If the distribution is defined on a limited domain, this function needs to be directly overloaded – see pygama.math.functions.step for an example.
norm_cdf(x, x_lower, x_upper, shapes, mu, sigma):

In the case that the support of the distribution is the entire real line, this calls a pygama_continuous super method that derives the cdf from a pdf that is normalized to unity on the range \([x_{lower},x_{upper}]\) by computing \(\frac{cdf(x)}{cdf(x_{upper})- cdf(x_{lower})}\). If the distribution is defined on a limited domain, this function needs to be directly overloaded – see pygama.math.functions.step for an example.
pdf_ext(x, area, x_lower, x_upper, shapes, mu, sigma):

A function that returns both the integral of the support-normalized pdf on the interval \([x_{lower},x_{upper}]\), as well as the support-normalized pdf on that range as well. The fastest way to compute these is to return np.diff(nb_distribution_scaled_cdf(np.array([x_lower, x_upper]), area, shapes, mu, sigma)) and nb_distribution_scaled_pdf(x, area, shapes, mu, sigma)
cdf_ext(x, area, shapes, mu, sigma):

A function that returns just the scaled cdf, nb_distribution_scaled_cdf(x, area, shapes, mu, sigma)
req_args(self):

A function that returns a tuple with the strings of the names of the required shape parameters and mu and sigma

[25]:

from pygama.math.functions.pygama_continuous import pygama_continuous


class cauchy_gen(pygama_continuous):
    def _pdf(self, x: np.ndarray) -> np.ndarray:
        x.flags.writeable = True
        return nb_cauchy_pdf(x, 0, 1)

    def _cdf(self, x: np.ndarray) -> np.ndarray:
        x.flags.writeable = True
        return nb_cauchy_cdf(x, 0, 1)

    def get_pdf(self, x: np.ndarray, mu: float, sigma: float) -> np.ndarray:
        return nb_cauchy_pdf(x, mu, sigma)

    def get_cdf(self, x: np.ndarray, mu: float, sigma: float) -> np.ndarray:
        return nb_cauchy_cdf(x, mu, sigma)

    # needed so that we can hack iminuit's introspection to function parameter names.
    def norm_pdf(
        self, x: np.ndarray, x_lower: float, x_upper: float, mu: float, sigma: float
    ) -> np.ndarray:
        return self._norm_pdf(x, x_lower, x_upper, mu, sigma)

    def norm_cdf(
        self, x: np.ndarray, x_lower: float, x_upper: float, mu: float, sigma: float
    ) -> np.ndarray:
        return self._norm_cdf(x, x_lower, x_upper, mu, sigma)

    def pdf_ext(
        self,
        x: np.ndarray,
        area: float,
        x_lower: float,
        x_upper: float,
        mu: float,
        sigma: float,
    ) -> np.ndarray:
        return np.diff(
            nb_cauchy_scaled_cdf(np.array([x_lower, x_upper]), area, mu, sigma)
        ), nb_cauchy_scaled_pdf(x, area, mu, sigma)

    def cdf_ext(
        self, x: np.ndarray, area: float, mu: float, sigma: float
    ) -> np.ndarray:
        return nb_cauchy_scaled_cdf(x, area, mu, sigma)

    def required_args(self) -> tuple[str, str]:
        return "mu", "sigma"


cauchy = cauchy_gen(name="cauchy")

[26]:

from scipy.stats import cauchy as scipy_cauchy

cauchy_pdf = cauchy.pdf(x, mu, sigma)
cauchy_get_pdf = cauchy.get_pdf(x, mu, sigma)

cauchy_cdf = cauchy.cdf(x, mu, sigma)
cauchy_get_cdf = cauchy.get_cdf(x, mu, sigma)

plt.plot(x, cauchy_pdf, label="Pygama Cauchy PDF", c="k")
plt.plot(x, cauchy_get_pdf, label="Pygama Cauchy get_pdf", c="b", ls="--")
plt.plot(
    x, scipy_cauchy.pdf(x, mu, sigma), label="Scipy Cauchy PDF", ls=":", alpha=1, c="r"
)
plt.title("PDF of a Cauchy Distribution")
plt.legend()
plt.show()


plt.plot(x, cauchy_cdf, label="Pygama Cauchy CDF", c="k")
plt.plot(x, cauchy_get_cdf, label="Pygama Cauchy get_cdf", c="b", ls="--")
plt.plot(
    x, scipy_cauchy.cdf(x, mu, sigma), label="Scipy Cauchy CDF", ls=":", alpha=1, c="r"
)
plt.title("CDF of a Cauchy Distribution")
plt.legend()
plt.show()

../_images/notebooks_MathTutorial_41_0.png

../_images/notebooks_MathTutorial_41_1.png

Perform some fits using the new Cauchy distribution to show the other methods’ validity#

[27]:

xr = (-4, 4)  # xrange
mu = 0.5
sigma = 0.7

xdata = scipy_cauchy.rvs(mu, sigma, size=1000, random_state=42)
xdata = xdata[(xr[0] < xdata) & (xdata < xr[1])]

n, xe = np.histogram(xdata, bins=50, range=xr)
cx = 0.5 * (xe[1:] + xe[:-1])
dx = np.diff(xe)

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.plot(xdata, np.zeros_like(xdata), "|", alpha=0.1);

../_images/notebooks_MathTutorial_43_0.png

unbinned fit#

[28]:

c = cost.UnbinnedNLL(xdata, cauchy.norm_pdf)

m_norm = Minuit(c, x_lower=xr[0], x_upper=xr[1], mu=0.4, sigma=0.2)
m_norm.fixed["x_lower", "x_upper"] = True
m_norm.limits["mu", "sigma"] = (0, None)
m_norm.migrad()

[28]:

Migrad
FCN = 2784		Nfcn = 56
EDM = 1.66e-07 (Goal: 0.0002)		time = 0.7 sec
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	x_lower	-4.00	0.04					yes
1	x_upper	4.00	0.04					yes
2	mu	0.478	0.033			0
3	sigma	0.73	0.04			0

	x_lower	x_upper	mu	sigma
x_lower	0	0	0	0
x_upper	0	0	0	0
mu	0	0	0.00106	-2.25e-05 (-0.018)
sigma	0	0	-2.25e-05 (-0.018)	0.00147

extended unbinned fit#

[29]:

c = cost.ExtendedUnbinnedNLL(xdata, cauchy.pdf_ext)

m = Minuit(c, area=500, x_lower=xr[0], x_upper=xr[1], mu=0.1, sigma=0.9)
m.fixed["x_lower", "x_upper"] = True
m.limits["area", "mu", "sigma"] = (0.01, None)
m.migrad()

[29]:

Migrad
FCN = -7457		Nfcn = 102
EDM = 2.22e-07 (Goal: 0.0002)		time = 0.3 sec
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	area	1.002e3	0.034e3			0.01
1	x_lower	-4.00	0.04					yes
2	x_upper	4.00	0.04					yes
3	mu	0.478	0.033			0.01
4	sigma	0.73	0.04			0.01

	area	x_lower	x_upper	mu	sigma
area	1.18e+03	0	0	0.00417	0.26 (0.197)
x_lower	0	0	0	0	0
x_upper	0	0	0	0	0
mu	0.00417	0	0	0.00106	-2.19e-05 (-0.018)
sigma	0.26 (0.197)	0	0	-2.19e-05 (-0.018)	0.00147

Binned fit#

[30]:

c = cost.BinnedNLL(n, xe, cauchy.norm_cdf)

m_norm = Minuit(c, x_lower=xr[0], x_upper=xr[1], mu=0.4, sigma=0.2)
m_norm.fixed["x_lower", "x_upper"] = True
m_norm.limits["mu", "sigma"] = (0.1, None)
m_norm.migrad()

[30]:

Migrad
FCN = 52.27 (chi2/ndof = 1.1)		Nfcn = 48
EDM = 3.04e-05 (Goal: 0.0002)
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	x_lower	-4.00	0.04					yes
1	x_upper	4.00	0.04					yes
2	mu	0.482	0.033			0.1
3	sigma	0.73	0.04			0.1

	x_lower	x_upper	mu	sigma
x_lower	0	0	0	0
x_upper	0	0	0	0
mu	0	0	0.00107	-2.99e-05 (-0.024)
sigma	0	0	-2.99e-05 (-0.024)	0.00149

Extended binned fit#

[31]:

c = cost.ExtendedBinnedNLL(n, xe, cauchy.cdf_ext)

m = Minuit(c, area=500, mu=0.1, sigma=0.9)
m.limits["area", "mu", "sigma"] = (0, None)
m.migrad()

[31]:

Migrad
FCN = 52.27 (chi2/ndof = 1.1)		Nfcn = 95
EDM = 6.18e-06 (Goal: 0.0002)
Valid Minimum		No Parameters at limit
Below EDM threshold (goal x 10)		Below call limit
Covariance	Hesse ok	Accurate	Pos. def.	Not forced

	Name	Value	Hesse Error	Minos Error-	Minos Error+	Limit-	Limit+	Fixed
0	area	1.002e3	0.034e3			0
1	mu	0.483	0.033			0
2	sigma	0.73	0.04			0

	area	mu	sigma
area	1.18e+03	0.00298	0.264 (0.199)
mu	0.00298	0.00107	-2.98e-05 (-0.024)
sigma	0.264 (0.199)	-2.98e-05 (-0.024)	0.00149

Adding distributions with `sum_dists`#

In the business of fitting data, adding two distributions together is our bread-and-butter. This part of the notebook will show you how to create your own distribution that subclasses pygama.math.sum_dists. There are a couple of different use cases for adding distributions together, and we will look at each in turn, but here is a summary:

Adding different fractions of distributions together, and fitting out the relative amount of distributions
Adding distributions with different areas (present in equal fractions) and fitting the areas
Adding distributions with different areas that have different fractions, and fitting out the areas and fractions

Let’s first create the synthetic data that we will fit#

[32]:

xr = (-4, 4)  # xrange

rng = np.random.default_rng(1)
mu = 0.5
sigma = 1.3

mu2 = 1
sigma2 = 0.6


xdata = norm.rvs(mu, sigma, size=1000, random_state=42)
ydata = scipy_cauchy.rvs(mu2, sigma2, size=len(xdata), random_state=42)
xmix = np.append(xdata, ydata)
xmix = xmix[(xr[0] < xmix) & (xmix < xr[1])]

n, xe = np.histogram(xmix, bins=50, range=xr)
cx = 0.5 * (xe[1:] + xe[:-1])
dx = np.diff(xe)

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.plot(xmix, np.zeros_like(xmix), "|", alpha=0.1);

../_images/notebooks_MathTutorial_53_0.png

A little about `sum_dists`#

The first important thing to note about this class is that all method class are of the form method(x, parameter_array). We want to use a single array as an argument because we need to accommodate an arbitrarily large number of parameters. In addition, a single array is an acceptable input for Iminuit fits.

So, the first thing a user does in creating a new class is define the order of elements in a parameter_array. Now, sum_dists works by grabbing elements at different indices in this parameter_array according to rules that the user provides in the __init__. Here’s how to write a generic class:

Create a parameter_index_array that holds the indices of what will eventually come in the parameter_array. If the user will eventually pass parameters = [frac1, mu, sigma] then we just take parameter_index_array=[frac1, mu, sigma]=range(3)
sum_dists takes an alternating pattern of distributions and distribution-specific parameter_index_arrays. Each par array can contain [shape, mu, sigma, area, frac] with area and frac being optional (depending on the flag sent to the constructor) but must be placed in that order.
We pass one of the 4 flag options, to be described below.

Let’s trace through how the code works for a simple example. Suppose we want to create the following distribution

\(new\_pdf(x, [\tau, \mu, \sigma, frac_1, frac_2]) = frac_1\cdot dist_1(x, \tau, \mu, \sigma) + frac_2\cdot dist_2(x, \mu, \sigma)\)

The first thing we would do is create our parameter_index_array

[tau, mu, sigma, frac_1, frac_2] = range(5)

Then, we would create our alternating pattern of distributions and parameter index arrays:

args = [dist1, [tau, mu, sigma, frac_1], dist2, [mu, sigma, frac2]]

Finally, we would intitalize (with the fracs flag in this case, more on that later)

super().__init__(*args, frac_flag = "fracs")

So… What is `sum_dists` actually doing?#

Under the hood, sum_dists is applying a set of rules so that the following is always computed, regardless of the flag sent to the constructor.

total_area*area1*frac1*dist1(x, shape, mu, sigma) + ... total_area*area_n*frac_n*dist_n(x, shape_1, mu_n, sigma_n)

It does this by first reading through all of the parameter index arrays for each distribution and separating the indices for the shape parameters, areas, and fracs into separate index arrays. Then, at the time of method call, these separate parameter index arrays are used to grab the actual values for the shape parameters, fractions, and areas. There’s also some work done to determine which of area and frac are present. That’s the purpose of the frac_flag. Let’s take some time and learn a little more about what each flag does.

The `areas` flag in the `sum_dists` constructor#

Let’s say we are interested in knowing the amount of counts present in a signal and a background in our total spectrum, i.e. we want to create and fit a function that looks like

\(pdf= A\cdot gauss\_pdf + B\cdot cauchy\_pdf\)

Because we are interested in fitting the areas, we send the areas keyword to frac_flag. This causes sum_dists to assume that the last element in each parameter index array in the constructor is an area for that specific distribution. By default, all the fracs are set to 1, as well as the total_area.

[33]:

from pygama.math.functions.sum_dists import sum_dists


class cauchy_on_gauss_gen(sum_dists):
    def __init__(self):
        # we first create an array contains the indices of the parameter array
        # that will eventually be input to function calls
        (area1, mu, sigma, area2, mu2, sigma2) = range(6)

        # we now create an array containing the distributions and their shape parameters
        # The last item in a parameter array is the fraction or area component
        # The last item in the last parameter array corresponds to the total area, if present
        # but because we are fitting individual areas, the total area is meaningless
        args = [gaussian, [mu, sigma, area1], cauchy, [mu2, sigma2, area2]]
        # we initialize with the frac_flag = "areas" to let the constructor know we are sending frac parameters only
        super().__init__(*args, frac_flag="areas")

    def get_req_args(self) -> tuple[str, str, str]:
        r"""
        Return the required arguments for this instance
        """
        return "area1", "mu", "sigma", "area2", "mu2", "sigma2"


cauchy_on_gauss = cauchy_on_gauss_gen()

That was super quick, let’s see how it does with fitting data#

Because we are interested in fitting out areas, we need to use the extended forms of our fits And for pdf_ext methods we have two options: we either pass x_lo, x_hi at the start of our parameter array, or we don’t pass them and the function automatically takes the np.amin(x), np.amax(x) as those values

something to note: all parameter arrays passed to `sum_dist` methods must be `numpy` arrays; for example, see that we must turn `m.values` into `np.array(m.values)` in order for it to work#

[34]:

c = cost.ExtendedUnbinnedNLL(xmix, cauchy_on_gauss.pdf_ext)

m = Minuit(c, (xr[0], xr[1], 100, 0.1, 0.3, 100, 1, 2))
m.fixed[0, 1] = True  # fix the x_lo, x_hi
m.limits[2, 5] = (0, None)
m.limits[3, 4, 6, 7] = (0, 2)
print(m.migrad())

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, cauchy_on_gauss.pdf_ext(xm, np.array(m.values))[1] * dx[0], label="fit")
plt.plot(
    xm,
    cauchy_on_gauss.pdf_ext(xm, np.array([1000, mu, sigma, 1000, mu2, sigma2]))[1]
    * dx[0],
    label="actual",
)
plt.legend();

┌─────────────────────────────────────────────────────────────────────────┐
│                                Migrad                                   │
├──────────────────────────────────┬──────────────────────────────────────┤
│ FCN = -1.875e+04                 │              Nfcn = 408              │
│ EDM = 9.35e-05 (Goal: 0.0002)    │            time = 0.9 sec            │
├──────────────────────────────────┼──────────────────────────────────────┤
│          Valid Minimum           │        No Parameters at limit        │
├──────────────────────────────────┼──────────────────────────────────────┤
│ Below EDM threshold (goal x 10)  │           Below call limit           │
├───────────────┬──────────────────┼───────────┬─────────────┬────────────┤
│  Covariance   │     Hesse ok     │ Accurate  │  Pos. def.  │ Not forced │
└───────────────┴──────────────────┴───────────┴─────────────┴────────────┘
┌───┬──────┬───────────┬───────────┬────────────┬────────────┬─────────┬─────────┬───────┐
│   │ Name │   Value   │ Hesse Err │ Minos Err- │ Minos Err+ │ Limit-  │ Limit+  │ Fixed │
├───┼──────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┼───────┤
│ 0 │ x0   │   -4.00   │   0.04    │            │            │         │         │  yes  │
│ 1 │ x1   │   4.00    │   0.04    │            │            │         │         │  yes  │
│ 2 │ x2   │  1.19e3   │  0.15e3   │            │            │    0    │         │       │
│ 3 │ x3   │   0.57    │   0.07    │            │            │    0    │    2    │       │
│ 4 │ x4   │   1.32    │   0.05    │            │            │    0    │    2    │       │
│ 5 │ x5   │  0.78e3   │  0.17e3   │            │            │    0    │         │       │
│ 6 │ x6   │   0.98    │   0.04    │            │            │    0    │    2    │       │
│ 7 │ x7   │   0.50    │   0.09    │            │            │    0    │    2    │       │
└───┴──────┴───────────┴───────────┴────────────┴────────────┴─────────┴─────────┴───────┘
┌────┬─────────────────────────────────────────────────────────────────────────────────┐
│    │        x0        x1        x2        x3        x4        x5        x6        x7 │
├────┼─────────────────────────────────────────────────────────────────────────────────┤
│ x0 │         0         0         0         0         0         0         0         0 │
│ x1 │         0         0         0         0         0         0         0         0 │
│ x2 │         0         0  2.17e+04      6.02      2.27 -2.38e+04     0.492     -10.9 │
│ x3 │         0         0      6.02   0.00465  0.000935     -6.86 -0.000945  -0.00302 │
│ x4 │         0         0      2.27  0.000935   0.00242      -2.7 -0.000432   -0.0023 │
│ x5 │         0         0 -2.38e+04     -6.86      -2.7  2.84e+04    -0.587      12.8 │
│ x6 │         0         0     0.492 -0.000945 -0.000432    -0.587   0.00194 -0.000165 │
│ x7 │         0         0     -10.9  -0.00302   -0.0023      12.8 -0.000165   0.00805 │
└────┴─────────────────────────────────────────────────────────────────────────────────┘

../_images/notebooks_MathTutorial_57_1.png

Well, that’s pretty good! Let’s see how the extended unbinned fit fairs

[35]:

c = cost.ExtendedBinnedNLL(n, xe, cauchy_on_gauss.cdf_ext)
m = Minuit(c, (100, 0.1, 0.3, 100, 1, 2))
m.limits[0, 3] = (0, None)
m.limits[1, 2, 4, 5] = (0, 2)
print(m.migrad())

plt.errorbar(cx, n, n**0.5, fmt="ok")
plt.stairs(np.diff(cauchy_on_gauss.cdf_ext(xe, np.array(m.values))), xe, label="fit")
plt.stairs(
    np.diff(
        cauchy_on_gauss.cdf_ext(xe, np.array([1000, mu, sigma, 1000, mu2, sigma2]))
    ),
    xe,
    label="actual",
)
plt.legend();

┌─────────────────────────────────────────────────────────────────────────┐
│                                Migrad                                   │
├──────────────────────────────────┬──────────────────────────────────────┤
│ FCN = 58.66 (chi2/ndof = 1.3)    │              Nfcn = 461              │
│ EDM = 9.22e-05 (Goal: 0.0002)    │                                      │
├──────────────────────────────────┼──────────────────────────────────────┤
│          Valid Minimum           │        No Parameters at limit        │
├──────────────────────────────────┼──────────────────────────────────────┤
│ Below EDM threshold (goal x 10)  │           Below call limit           │
├───────────────┬──────────────────┼───────────┬─────────────┬────────────┤
│  Covariance   │     Hesse ok     │ Accurate  │  Pos. def.  │ Not forced │
└───────────────┴──────────────────┴───────────┴─────────────┴────────────┘
┌───┬──────┬───────────┬───────────┬────────────┬────────────┬─────────┬─────────┬───────┐
│   │ Name │   Value   │ Hesse Err │ Minos Err- │ Minos Err+ │ Limit-  │ Limit+  │ Fixed │
├───┼──────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┼───────┤
│ 0 │ x0   │  1.20e3   │  0.14e3   │            │            │    0    │         │       │
│ 1 │ x1   │   0.57    │   0.06    │            │            │    0    │    2    │       │
│ 2 │ x2   │   1.32    │   0.05    │            │            │    0    │    2    │       │
│ 3 │ x3   │  0.77e3   │  0.15e3   │            │            │    0    │         │       │
│ 4 │ x4   │   0.98    │   0.05    │            │            │    0    │    2    │       │
│ 5 │ x5   │   0.49    │   0.08    │            │            │    0    │    2    │       │
└───┴──────┴───────────┴───────────┴────────────┴────────────┴─────────┴─────────┴───────┘
┌────┬─────────────────────────────────────────────────────────────┐
│    │        x0        x1        x2        x3        x4        x5 │
├────┼─────────────────────────────────────────────────────────────┤
│ x0 │  1.82e+04      4.66      1.48 -1.96e+04     0.631     -9.06 │
│ x1 │      4.66   0.00388  0.000735     -5.31 -0.000857  -0.00235 │
│ x2 │      1.48  0.000735   0.00208     -1.76 -0.000397  -0.00177 │
│ x3 │ -1.96e+04     -5.31     -1.76  2.35e+04    -0.736      10.7 │
│ x4 │     0.631 -0.000857 -0.000397    -0.736   0.00203 -0.000313 │
│ x5 │     -9.06  -0.00235  -0.00177      10.7 -0.000313   0.00708 │
└────┴─────────────────────────────────────────────────────────────┘

../_images/notebooks_MathTutorial_59_1.png

The `fracs` flag in the `sum_dists` constructor#

To get a feel for how sum_dists works, let’s create a function that creates the following distribution:

\(pdf = f_1\cdot gauss\_{pdf}+(1-f_1)\cdot cauchy\_{pdf}\)

Well, it might seem crazy, but we can actually link together parameters before they are fit! That’s the beauty of the parameter index array and the link_pars function that we can overload.

The fracs keyword causes sum_dists to assume the following about the parameter index arrays in the __init__: each parameter index array looks like [shapes, mu, sigma, frac] with the exception of the last parameter index array which has the form [shapes, mu, sigma, frac, total_area].

Well, what if we don’t care about total_area? Or, what if we want to link parameters, like the f_1 value in the above example? That’s where link_pars comes in. Nominally, link_pars serves to extract the actual values of the shapes, fracs, and areas from the passed parameter array. But, if we overload it, we can actually link the parameters together! See the below code for an example.

[36]:

from pygama.math.functions.sum_dists import sum_dists


class cauchy_on_gauss_gen(sum_dists):
    def __init__(self):
        # we first create an array contains the indices of the parameter array
        # that will eventually be input to function calls
        (frac1, mu, sigma, mu2, sigma2) = range(5)

        # we now create an array containing the distributions and their shape parameters
        # The last item in a parameter array is the fraction or area component
        # The last item in the last parameter array corresponds to the total area if present
        args = [gaussian, [mu, sigma, frac1], cauchy, [mu2, sigma2, frac1, frac1]]
        # we initialize with the frac_flag = "fracs" to let the constructor know we are sending frac parameters only
        super().__init__(*args, frac_flag="fracs")

    def _link_pars(
        self,
        shape_par_idx,
        area_idx,
        frac_idx,
        total_area_idx,
        params,
        areas,
        fracs,
        total_area,
    ):
        shape_pars, cum_len, areas, fracs, total_area = super()._link_pars(
            shape_par_idx,
            area_idx,
            frac_idx,
            total_area_idx,
            params,
            areas,
            fracs,
            total_area,
        )

        fracs[1] = (
            1 - fracs[0]
        )  # create :math:`(1-frac1)` for the cauchy, and :math:`frac1` for the exgauss
        total_area[
            0
        ] = 1  # just overwrite the total area because we don't want to fit it

        return shape_pars, cum_len, areas, fracs, total_area

    def get_req_args(self) -> tuple[str, str, str]:
        r"""
        Return the required arguments for this instance
        """
        return "frac1", "mu", "sigma", "mu2", "sigma2"


cauchy_on_gauss = cauchy_on_gauss_gen()

Again, that was painless; let’s see how well it does fitting data#

[37]:

c = cost.UnbinnedNLL(xmix, cauchy_on_gauss.norm_pdf)

m = Minuit(c, (xr[0], xr[1], 0.5, 0.1, 0.2, 1, 0.1))
m.fixed[0, 1] = True
m.limits[2] = (0, 1)
m.limits[3, 4, 5, 6] = (0.5, 2)
print(m.migrad())

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(
    xm,
    cauchy_on_gauss.norm_pdf(xm, np.array(m.values)) * len(xmix) * dx[0],
    label="fit",
)
plt.plot(
    xm,
    cauchy_on_gauss.get_pdf(xm, np.array([0.5, mu, sigma, mu2, sigma2]))
    * len(xmix)
    * dx[0],
    label="actual",
)
plt.legend();

┌─────────────────────────────────────────────────────────────────────────┐
│                                Migrad                                   │
├──────────────────────────────────┬──────────────────────────────────────┤
│ FCN = 6048                       │              Nfcn = 353              │
│ EDM = 4.39e-05 (Goal: 0.0002)    │                                      │
├──────────────────────────────────┼──────────────────────────────────────┤
│          Valid Minimum           │       SOME Parameters at limit       │
├──────────────────────────────────┼──────────────────────────────────────┤
│ Below EDM threshold (goal x 10)  │           Below call limit           │
├───────────────┬──────────────────┼───────────┬─────────────┬────────────┤
│  Covariance   │     Hesse ok     │ Accurate  │  Pos. def.  │ Not forced │
└───────────────┴──────────────────┴───────────┴─────────────┴────────────┘
┌───┬──────┬───────────┬───────────┬────────────┬────────────┬─────────┬─────────┬───────┐
│   │ Name │   Value   │ Hesse Err │ Minos Err- │ Minos Err+ │ Limit-  │ Limit+  │ Fixed │
├───┼──────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┼───────┤
│ 0 │ x0   │   -4.00   │   0.04    │            │            │         │         │  yes  │
│ 1 │ x1   │   4.00    │   0.04    │            │            │         │         │  yes  │
│ 2 │ x2   │   0.60    │   0.05    │            │            │    0    │    1    │       │
│ 3 │ x3   │   0.57    │   0.06    │            │            │   0.5   │    2    │       │
│ 4 │ x4   │   1.32    │   0.04    │            │            │   0.5   │    2    │       │
│ 5 │ x5   │   0.98    │   0.04    │            │            │   0.5   │    2    │       │
│ 6 │ x6   │    0.5    │    1.1    │            │            │   0.5   │    2    │       │
└───┴──────┴───────────┴───────────┴────────────┴────────────┴─────────┴─────────┴───────┘
┌────┬───────────────────────────────────────────────────────────────────────┐
│    │        x0        x1        x2        x3        x4        x5        x6 │
├────┼───────────────────────────────────────────────────────────────────────┤
│ x0 │         0         0         0         0         0         0         0 │
│ x1 │         0         0         0         0         0         0         0 │
│ x2 │         0         0   0.00226   0.00117 -0.000229  0.000145 -0.000197 │
│ x3 │         0         0   0.00117   0.00334  0.000264 -0.000948 -9.47e-05 │
│ x4 │         0         0 -0.000229  0.000264   0.00176 -0.000472 -7.45e-05 │
│ x5 │         0         0  0.000145 -0.000948 -0.000472   0.00195 -8.09e-06 │
│ x6 │         0         0 -0.000197 -9.47e-05 -7.45e-05 -8.09e-06  7.21e-05 │
└────┴───────────────────────────────────────────────────────────────────────┘

../_images/notebooks_MathTutorial_63_1.png

The `both` flag in the `sum_dists` constructor#

What if we wanted to find the signal and background counts for distributions that share unequal fractions? I.e. what if we want to construct the following:

\(pdf = Af_1\cdot gauss\_{pdf}+B(1-f_1)\cdot cauchy\_{pdf}\)

It’s easy! We use the both flag!

The both keywords causes sum_dists to look for ordering of parameters in each index array like [shape, mu, sigma, area, frac], except the last parameter array index must have the form [shape, mu, sigma, area, frac, total_area]

[38]:

from pygama.math.functions.sum_dists import sum_dists


class cauchy_on_gauss_gen(sum_dists):
    def __init__(self):
        # we first create an array contains the indices of the parameter array
        # that will eventually be input to function calls
        (area1, frac1, mu, sigma, area2, mu2, sigma2) = range(7)

        # we now create an array containing the distributions and their shape parameters
        # The last item in a parameter array is the fraction or area component
        # The last item in the last parameter array corresponds to the total area if present
        args = [
            gaussian,
            [mu, sigma, area1, frac1],
            cauchy,
            [mu2, sigma2, area2, frac1, frac1],
        ]
        # we initialize with the frac_flag = "fracs" to let the constructor know we are sending frac parameters only
        super().__init__(*args, frac_flag="both")

    def _link_pars(
        self,
        shape_par_idx,
        area_idx,
        frac_idx,
        total_area_idx,
        params,
        areas,
        fracs,
        total_area,
    ):
        shape_pars, cum_len, areas, fracs, total_area = super()._link_pars(
            shape_par_idx,
            area_idx,
            frac_idx,
            total_area_idx,
            params,
            areas,
            fracs,
            total_area,
        )

        fracs[1] = (
            1 - fracs[0]
        )  # create :math:`(1-frac1)` for the gaussian, and :math:`htail` for the exgauss
        total_area[
            0
        ] = 1  # just overwrite the total area because we don't want to fit it

        return shape_pars, cum_len, areas, fracs, total_area

    def get_req_args(self) -> tuple[str, str, str]:
        r"""
        Return the required arguments for this instance
        """
        return "area1", "frac1", "mu", "sigma", "area2", "mu2", "sigma2"


cauchy_on_gauss = cauchy_on_gauss_gen()

[39]:

c = cost.ExtendedUnbinnedNLL(xmix, cauchy_on_gauss.pdf_ext)

m = Minuit(c, (xr[0], xr[1], 100, 0.1, 0.1, 0.3, 100, 1, 2))
m.fixed[0, 1] = True  # fix the x_lo, x_hi
m.limits[2, 6] = (0, None)
m.limits[3] = (0, 1)
m.limits[4, 5, 7, 8] = (0, 2)
print(m.migrad())

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, cauchy_on_gauss.pdf_ext(xm, np.array(m.values))[1] * dx[0], label="fit")
plt.legend();

┌─────────────────────────────────────────────────────────────────────────┐
│                                Migrad                                   │
├──────────────────────────────────┬──────────────────────────────────────┤
│ FCN = -1.875e+04                 │              Nfcn = 458              │
│ EDM = 1.58e-05 (Goal: 0.0002)    │            time = 0.1 sec            │
├──────────────────────────────────┼──────────────────────────────────────┤
│          Valid Minimum           │        No Parameters at limit        │
├──────────────────────────────────┼──────────────────────────────────────┤
│ Below EDM threshold (goal x 10)  │           Below call limit           │
├───────────────┬──────────────────┼───────────┬─────────────┬────────────┤
│  Covariance   │     Hesse ok     │APPROXIMATE│NOT pos. def.│   FORCED   │
└───────────────┴──────────────────┴───────────┴─────────────┴────────────┘
┌───┬──────┬───────────┬───────────┬────────────┬────────────┬─────────┬─────────┬───────┐
│   │ Name │   Value   │ Hesse Err │ Minos Err- │ Minos Err+ │ Limit-  │ Limit+  │ Fixed │
├───┼──────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┼───────┤
│ 0 │ x0   │   -4.00   │   0.04    │            │            │         │         │  yes  │
│ 1 │ x1   │   4.00    │   0.04    │            │            │         │         │  yes  │
│ 2 │ x2   │   4.4e3   │   2.3e3   │            │            │    0    │         │       │
│ 3 │ x3   │   0.27    │   0.14    │            │            │    0    │    1    │       │
│ 4 │ x4   │   0.57    │   0.06    │            │            │    0    │    2    │       │
│ 5 │ x5   │   1.32    │   0.05    │            │            │    0    │    2    │       │
│ 6 │ x6   │  1.07e3   │  0.27e3   │            │            │    0    │         │       │
│ 7 │ x7   │   0.98    │   0.04    │            │            │    0    │    2    │       │
│ 8 │ x8   │   0.50    │   0.08    │            │            │    0    │    2    │       │
└───┴──────┴───────────┴───────────┴────────────┴────────────┴─────────┴─────────┴───────┘
┌────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│    │        x0        x1        x2        x3        x4        x5        x6        x7        x8 │
├────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ x0 │         0         0         0         0         0         0         0         0         0 │
│ x1 │         0         0         0         0         0         0         0         0         0 │
│ x2 │         0         0  5.36e+06      -324      5.97      2.36 -5.09e+05         1     -12.5 │
│ x3 │         0         0      -324    0.0204  0.000599  0.000153      26.7  3.53e-05 -0.000986 │
│ x4 │         0         0      5.97  0.000599   0.00384  0.000706     -5.74 -0.000882  -0.00206 │
│ x5 │         0         0      2.36  0.000153  0.000706    0.0021     -1.93 -0.000405  -0.00165 │
│ x6 │         0         0 -5.09e+05      26.7     -5.74     -1.93   7.3e+04    -0.631      11.1 │
│ x7 │         0         0         1  3.53e-05 -0.000882 -0.000405    -0.631   0.00193 -0.000193 │
│ x8 │         0         0     -12.5 -0.000986  -0.00206  -0.00165      11.1 -0.000193   0.00617 │
└────┴───────────────────────────────────────────────────────────────────────────────────────────┘

../_images/notebooks_MathTutorial_66_1.png

Let’s see if that last fit agrees with a fit from a function we would define “by hand”#

[40]:

def cauchy_gauss_pdf_ext(x, x_lo, x_hi, A, B, frac, mu, sigma, mu2, sigma2):
    return A * frac * np.diff(
        gaussian.get_cdf(np.array([x_lo, x_hi]), mu, sigma)
    ) + B * (1 - frac) * np.diff(
        cauchy.get_cdf(np.array([x_lo, x_hi]), mu2, sigma2)
    ), A * frac * gaussian.get_pdf(
        x, mu, sigma
    ) + (
        1 - frac
    ) * B * cauchy.get_pdf(
        x, mu2, sigma2
    )

[41]:

c = cost.ExtendedUnbinnedNLL(xmix, cauchy_gauss_pdf_ext)

m = Minuit(c, xr[0], xr[1], 100, 100, 0.1, 0.1, 0.3, 1, 2)
m.fixed[0, 1] = True  # fix the x_lo, x_hi
m.limits[2, 3] = (0, None)
m.limits[4] = (0, 1)
m.limits[5, 6, 7, 8] = (0, 2)
print(m.migrad())

plt.errorbar(cx, n, n**0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, cauchy_gauss_pdf_ext(xm, *m.values)[1] * dx[0], label="fit")
plt.legend();

┌─────────────────────────────────────────────────────────────────────────┐
│                                Migrad                                   │
├──────────────────────────────────┬──────────────────────────────────────┤
│ FCN = -1.875e+04                 │              Nfcn = 458              │
│ EDM = 1.58e-05 (Goal: 0.0002)    │                                      │
├──────────────────────────────────┼──────────────────────────────────────┤
│          Valid Minimum           │        No Parameters at limit        │
├──────────────────────────────────┼──────────────────────────────────────┤
│ Below EDM threshold (goal x 10)  │           Below call limit           │
├───────────────┬──────────────────┼───────────┬─────────────┬────────────┤
│  Covariance   │     Hesse ok     │APPROXIMATE│NOT pos. def.│   FORCED   │
└───────────────┴──────────────────┴───────────┴─────────────┴────────────┘
┌───┬────────┬───────────┬───────────┬────────────┬────────────┬─────────┬─────────┬───────┐
│   │ Name   │   Value   │ Hesse Err │ Minos Err- │ Minos Err+ │ Limit-  │ Limit+  │ Fixed │
├───┼────────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┼───────┤
│ 0 │ x_lo   │   -4.00   │   0.04    │            │            │         │         │  yes  │
│ 1 │ x_hi   │   4.00    │   0.04    │            │            │         │         │  yes  │
│ 2 │ A      │   4.4e3   │   2.3e3   │            │            │    0    │         │       │
│ 3 │ B      │  1.07e3   │  0.27e3   │            │            │    0    │         │       │
│ 4 │ frac   │   0.27    │   0.14    │            │            │    0    │    1    │       │
│ 5 │ mu     │   0.57    │   0.06    │            │            │    0    │    2    │       │
│ 6 │ sigma  │   1.32    │   0.05    │            │            │    0    │    2    │       │
│ 7 │ mu2    │   0.98    │   0.04    │            │            │    0    │    2    │       │
│ 8 │ sigma2 │   0.50    │   0.08    │            │            │    0    │    2    │       │
└───┴────────┴───────────┴───────────┴────────────┴────────────┴─────────┴─────────┴───────┘
┌────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│        │      x_lo      x_hi         A         B      frac        mu     sigma       mu2    sigma2 │
├────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│   x_lo │         0         0         0         0         0         0         0         0         0 │
│   x_hi │         0         0         0         0         0         0         0         0         0 │
│      A │         0         0  5.36e+06 -5.08e+05      -324      5.96      2.36         1     -12.5 │
│      B │         0         0 -5.08e+05   7.3e+04      26.7     -5.74     -1.93    -0.631      11.1 │
│   frac │         0         0      -324      26.7    0.0204    0.0006  0.000153  3.54e-05 -0.000986 │
│     mu │         0         0      5.96     -5.74    0.0006   0.00384  0.000706 -0.000882  -0.00206 │
│  sigma │         0         0      2.36     -1.93  0.000153  0.000706    0.0021 -0.000405  -0.00165 │
│    mu2 │         0         0         1    -0.631  3.54e-05 -0.000882 -0.000405   0.00193 -0.000193 │
│ sigma2 │         0         0     -12.5      11.1 -0.000986  -0.00206  -0.00165 -0.000193   0.00617 │
└────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

../_images/notebooks_MathTutorial_69_1.png

We see that sum_dists reproduces results quickly and without the need for spending time writing new methods for each of the summed distributions.

Conclusion#

Hopefully you have learned how to use the distributions and tools packaged in pygama for your own scientific purposes. If there’s a distribution you see is missing, feel free to contribute it using the conventions described above!

This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the pygama repository.

Introduction to Pygama Math and Fitting#

Set up the Python environment#

Statistical Distributions#

What rv_continuous subclassing gets us: .rvs, .logpdf, .sf…#

We get access to scipy’s rv_continuous methods for free#

Fast methods: .get_pdf and .get_cdf#

The remaining methods are helpful for binned and unbinned fitting: norm_pdf, norm_cdf, pdf_ext and cdf_ext#

Try an unbinned fit using the correct norm_pdf and the incorrect get_pdf#