bddb_dilemma.md

The Setup:

I have MRI data from a group of patients with brain damage (due to stroke, head injury, etc). For each patient, I have a binary MRI image (3D array) for each patient, where a value of 0 or 1 indicates the presence or absence of damage at a particular voxel (3D pixel). Each image is ~870k voxels.

The Goal:

Return a list of patients with damage at a given voxel.

Here's the stupid/inelegant way to do this:

voxel_id = 12345 # Index of the voxel I'm interested in
patient_matches = []
for patient in patient_list:
  patient_img_file = '/path/to/images/{}.MRI'.format(patient) # Get the path to the image for this patient.
  img = load_image(patient_img_file) # Load the image as a NumPy ndarray.
  img_flat = img.ravel() #Flatten to a 1D
  vox_val = img[voxel_id]
  if vox_val == 1:
    patient_matches.append(patient)

The Original Plan

Load and flatten each image, so I have a 1x870k vector for each patient.
Read these vectors into a DB (table='damage'), with 1 row per patient and 1 column per voxel.
Query select patient_id from damage where vox12345 = 1

Other Possibilities

I quickly realized that it's not reasonable/possible to have almost 1 million columns in a table, so my next thought was to have a separate table for each voxel (which would also reduce the amount of data that is read into memory with each query).

Another solution would be to read all the images into a single huge 2D table (n_patients x n_voxels) and query it using something like Pandas/xarray/PyTables/HDF.

danlurie/bddb_dilemma.md

The Setup:

The Goal:

The Original Plan

Other Possibilities