Handling LH5 data#
LEGEND stores its data in HDF5 format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented here.
Reading data from disk#
Let’s start by downloading a small test LH5 file with the legend-testdata package (it takes a while depending on your internet connection):
[1]:
from legend_testdata import LegendTestData
ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")
We can use pygama.lgdo.lh5_store.ls() [docs] to inspect the file contents:
[2]:
from pygama.lgdo import ls
ls(lh5_file)
[2]:
['geds']
This particular file contains an HDF5 group (they behave like directories). The second argument of ls() can be used to inspect a group (without the trailing /, only the group name is returned, if existing):
[3]:
ls(lh5_file, "geds/") # returns ['geds/raw'], which is a group again
ls(lh5_file, "geds/raw/")
[3]:
['geds/raw/baseline',
'geds/raw/channel',
'geds/raw/energy',
'geds/raw/ievt',
'geds/raw/numtraces',
'geds/raw/packet_id',
'geds/raw/timestamp',
'geds/raw/tracelist',
'geds/raw/waveform',
'geds/raw/wf_max',
'geds/raw/wf_std']
Note: Alternatively to ls(), show() [docs] prints a nice representation of the LH5 file contents (with LGDO types) on screen:
[4]:
from pygama.lgdo import show
show(lh5_file)
/
└── geds · HDF5 group
└── raw · table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}
├── baseline · array<1>{real}
├── channel · array<1>{real}
├── energy · array<1>{real}
├── ievt · array<1>{real}
├── numtraces · array<1>{real}
├── packet_id · array<1>{real}
├── timestamp · array<1>{real}
├── tracelist · array<1>{array<1>{real}}
│ ├── cumulative_length · array<1>{real}
│ └── flattened_data · array<1>{real}
├── waveform · table{t0,dt,values}
│ ├── dt · array<1>{real}
│ ├── t0 · array<1>{real}
│ └── values · array_of_equalsized_arrays<1,1>{real}
├── wf_max · array<1>{real}
└── wf_std · array<1>{real}
The group contains several LGDOs. Let’s read them in memory. We start by initializing an LH5Store [docs] object:
[5]:
from pygama.lgdo import LH5Store
store = LH5Store()
read_object() [docs] reads an LGDO from disk and returns the object in memory together with the number of rows (as a tuple), if an object has such a property. Let’s try to read geds/raw:
[6]:
store.read_object("geds/raw", lh5_file)
[6]:
(Table(dict={'packet_id': Array([ 1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([ 0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=Array([ 1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [ 9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{t0,dt,values}'})}, attrs={'datatype': 'table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}'}),
100)
As shown by the type signature, it is interpreted as a Table with 100 rows. Its contents (or “columns”) can be therefore viewed as LGDO objects of the same length. For example timestamp:
[7]:
obj, n_rows = store.read_object("geds/raw/timestamp", lh5_file)
obj
[7]:
Array([0.79465985 0.7968994 0.79960424 ... 0.97331905 0.974689
0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'})
is an LGDO Array with 100 elements.
read_object() also allows to perform more advanced data reading. For example, let’s read only rows from 15 to 25:
[8]:
obj, n_rows = store.read_object("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)
[0.82679445 0.8307392 0.8298773 0.830739 0.8339691 0.83487684
0.83510256 0.83612865 0.83797085 0.8406608 ] with attrs={'units': 's'}
Or, let’s read only columns timestamp and energy from the geds/raw table and rows [1, 3, 7, 9, 10, 15]:
[9]:
obj, n_rows = store.read_object(
"geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)
timestamp energy
0.796899 8642
0.799604 13015
0.812317 22085
0.813282 26636
0.813520 2648
0.826794 7799
with attrs['timestamp']={'units': 's'}
As you might have noticed, read_object() loads all the requested data in memory at once. This can be a problem when dealing with large datasets. LH5Iterator [docs] makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:
[10]:
from pygama.lgdo import LH5Iterator
for lh5_obj, entry, n_rows in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
print(f"entry {entry}, energy = {lh5_obj} ({n_rows} rows)")
entry 0, energy = [3304 8642 9177 ... 8289 7091 4084] (20 rows)
entry 20, energy = [11546 2873 3193 ... 4114 29557 12309] (20 rows)
entry 40, energy = [ 6455 3302 5314 ... 37333 4262 15131] (20 rows)
entry 60, energy = [ 6117 3358 3132 ... 2949 3691 10402] (20 rows)
entry 80, energy = [ 4088 41153 34295 ... 37877 6014 3410] (20 rows)
Writing data to disk#
Let’s start by creating some LGDOs:
[11]:
from pygama.lgdo import Array, Scalar, WaveformTable
import numpy as np
rng = np.random.default_rng(12345)
scalar = Scalar("made with pygama!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))
The write_object() [docs] method of LH5Store makes it possible to write LGDO objects on disk. Let’s start by writing scalar with name message in a file named my_data.lh5 in the current directory:
[12]:
store = LH5Store()
store.write_object(
scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file"
)
Let’s now inspect the file contents:
[13]:
from pygama.lgdo import show
show("my_objects.lh5")
/
└── message · string
The string object has been written at the root of the file /. Let’s now write also array and wf_table, this time in a HDF5 group called closet:
[14]:
store.write_object(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
store.write_object(
wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5"
)
show("my_objects.lh5")
/
├── closet · HDF5 group
│ ├── numbers · array<1>{real}
│ └── waveforms · table{t0,dt,values}
│ ├── dt · array<1>{real}
│ ├── t0 · array<1>{real}
│ └── values · array_of_equalsized_arrays<1,1>{real}
└── message · string
Everything looks right!
Note: write_objects() allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [docs] for more information.
This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the pygama repository.