Handling LH5 data#

LEGEND stores its data in HDF5 format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented here.

Reading data from disk#

Let’s start by downloading a small test LH5 file with the legend-testdata package (it takes a while depending on your internet connection):

[1]:
from legend_testdata import LegendTestData

ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")

We can use pygama.lgdo.lh5_store.ls() [docs] to inspect the file contents:

[2]:
from pygama.lgdo import ls

ls(lh5_file)
[2]:
['geds']

This particular file contains an HDF5 group (they behave like directories). The second argument of ls() can be used to inspect a group (without the trailing /, only the group name is returned, if existing):

[3]:
ls(lh5_file, "geds/")  # returns ['geds/raw'], which is a group again
ls(lh5_file, "geds/raw/")
[3]:
['geds/raw/baseline',
 'geds/raw/channel',
 'geds/raw/energy',
 'geds/raw/ievt',
 'geds/raw/numtraces',
 'geds/raw/packet_id',
 'geds/raw/timestamp',
 'geds/raw/tracelist',
 'geds/raw/waveform',
 'geds/raw/wf_max',
 'geds/raw/wf_std']

Note: Alternatively to ls(), show() [docs] prints a nice representation of the LH5 file contents (with LGDO types) on screen:

[4]:
from pygama.lgdo import show

show(lh5_file)
/
└── geds · HDF5 group
    └── raw · table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}
        ├── baseline · array<1>{real}
        ├── channel · array<1>{real}
        ├── energy · array<1>{real}
        ├── ievt · array<1>{real}
        ├── numtraces · array<1>{real}
        ├── packet_id · array<1>{real}
        ├── timestamp · array<1>{real}
        ├── tracelist · array<1>{array<1>{real}}
        │   ├── cumulative_length · array<1>{real}
        │   └── flattened_data · array<1>{real}
        ├── waveform · table{t0,dt,values}
        │   ├── dt · array<1>{real}
        │   ├── t0 · array<1>{real}
        │   └── values · array_of_equalsized_arrays<1,1>{real}
        ├── wf_max · array<1>{real}
        └── wf_std · array<1>{real}

The group contains several LGDOs. Let’s read them in memory. We start by initializing an LH5Store [docs] object:

[5]:
from pygama.lgdo import LH5Store

store = LH5Store()

read_object() [docs] reads an LGDO from disk and returns the object in memory together with the number of rows (as a tuple), if an object has such a property. Let’s try to read geds/raw:

[6]:
store.read_object("geds/raw", lh5_file)
[6]:
(Table(dict={'packet_id': Array([ 1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([ 0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=Array([ 1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [ 9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{t0,dt,values}'})}, attrs={'datatype': 'table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}'}),
 100)

As shown by the type signature, it is interpreted as a Table with 100 rows. Its contents (or “columns”) can be therefore viewed as LGDO objects of the same length. For example timestamp:

[7]:
obj, n_rows = store.read_object("geds/raw/timestamp", lh5_file)
obj
[7]:
Array([0.79465985 0.7968994  0.79960424 ... 0.97331905 0.974689
       0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'})

is an LGDO Array with 100 elements.

read_object() also allows to perform more advanced data reading. For example, let’s read only rows from 15 to 25:

[8]:
obj, n_rows = store.read_object("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)
[0.82679445 0.8307392  0.8298773  0.830739   0.8339691  0.83487684
 0.83510256 0.83612865 0.83797085 0.8406608 ] with attrs={'units': 's'}

Or, let’s read only columns timestamp and energy from the geds/raw table and rows [1, 3, 7, 9, 10, 15]:

[9]:
obj, n_rows = store.read_object(
    "geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)
 timestamp  energy
  0.796899    8642
  0.799604   13015
  0.812317   22085
  0.813282   26636
  0.813520    2648
  0.826794    7799

with attrs['timestamp']={'units': 's'}

As you might have noticed, read_object() loads all the requested data in memory at once. This can be a problem when dealing with large datasets. LH5Iterator [docs] makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:

[10]:
from pygama.lgdo import LH5Iterator

for lh5_obj, entry, n_rows in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
    print(f"entry {entry}, energy = {lh5_obj} ({n_rows} rows)")
entry 0, energy = [3304 8642 9177 ... 8289 7091 4084] (20 rows)
entry 20, energy = [11546  2873  3193 ...  4114 29557 12309] (20 rows)
entry 40, energy = [ 6455  3302  5314 ... 37333  4262 15131] (20 rows)
entry 60, energy = [ 6117  3358  3132 ...  2949  3691 10402] (20 rows)
entry 80, energy = [ 4088 41153 34295 ... 37877  6014  3410] (20 rows)

Writing data to disk#

Let’s start by creating some LGDOs:

[11]:
from pygama.lgdo import Array, Scalar, WaveformTable
import numpy as np

rng = np.random.default_rng(12345)

scalar = Scalar("made with pygama!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))

The write_object() [docs] method of LH5Store makes it possible to write LGDO objects on disk. Let’s start by writing scalar with name message in a file named my_data.lh5 in the current directory:

[12]:
store = LH5Store()

store.write_object(
    scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file"
)

Let’s now inspect the file contents:

[13]:
from pygama.lgdo import show

show("my_objects.lh5")
/
└── message · string

The string object has been written at the root of the file /. Let’s now write also array and wf_table, this time in a HDF5 group called closet:

[14]:
store.write_object(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
store.write_object(
    wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5"
)
show("my_objects.lh5")
/
├── closet · HDF5 group
│   ├── numbers · array<1>{real}
│   └── waveforms · table{t0,dt,values}
│       ├── dt · array<1>{real}
│       ├── t0 · array<1>{real}
│       └── values · array_of_equalsized_arrays<1,1>{real}
└── message · string

Everything looks right!

Note: write_objects() allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [docs] for more information.


This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the pygama repository.