cassandradl.CassandraDataset class

The only class that the user needs to interact with, in order to use the Cassandra Data Loader, is cassandra_dataset.CassandraDataset.

CassandraDataset.__init__(self, auth_prov, cassandra_ips, port=9042, seed=None)

Create ECVL Dataset from Cassandra DB

Parameters
  • auth_prov – Authenticator for Cassandra

  • cassandra_ips – List of Cassandra ip’s

  • port – TCP port to connect to (default: 9042)

  • seed – Seed for random generators

Returns

Return type

Th class must be initialized with the credentials and the hostname for connecting to the Cassandra DB, as in the following example:

from cassandra_dataset import CassandraDataset
from cassandra.auth import PlainTextAuthProvider

## Cassandra connection parameters
ap = PlainTextAuthProvider(username='user', password='pass')
cd = CassandraDataset(ap, ['cassandra-db'])

The next step is initializing a list manager, reading metadata from the DB.

CassandraDataset.init_listmanager(self, table, id_col, label_col='label', label_map=[], grouping_cols=[], num_classes=2, metatable=None)

Initialize the Cassandra list manager.

It takes care of loading/saving the full list of rows from the DB and creating the splits according to the user input.

Parameters
  • table – Metadata by natural keys

  • id_col – Cassandra id column for the images (e.g., ‘patch_id’)

  • label_col – Cassandra label column (default: ‘label’)

  • label_map – Transformation map for labels (e.g., [1,0] inverts the two classes)

  • grouping_cols – Columns to group by (e.g., [‘patient_id’])

  • num_classes – Number of classes (default: 2)

  • metatable – Metadata by uuid patch_id (optional)

Returns

Return type

CassandraDataset.read_rows_from_db(self, sample_whitelist=None)

Read the full list of rows from the DB.

Parameters

sample_whitelist – Whitelist for group keys

Returns

Return type

CassandraDataset.init_datatable(self, table, data_col='data')

Setup queries for db table containing raw data

Parameters
  • table – Data table, index by the uuid

  • data_col – Cassandra blob image column (default: ‘data’)

Returns

Return type

For example:

cd.init_listmanager(
  table='patches.metadata_by_nat',
  id_col='patch_id',
  label_col="label",
  grouping_cols=["patient_id"],
  num_classes=2
)
cd.read_rows_from_db()
cd.init_datatable(
  table='patches.data_by_uuid'
)

The parameter grouping_cols (optionally) specifies the columns by which the images should be grouped, before dividing the groups among the splits. For example, in digital pathology contexts we typically want that all images of the same patient go either into the training or the validation set. If grouping_cols is not specified, then each image forms a group by itself (i.e., a singlet).

After the list manager has been initialized and the metadata has been read from the DB, the splits can be created automatically, using the split_setup method.

CassandraDataset.split_setup(self, max_patches=None, split_ratios=None, augs=None, balance=None, seed=None, bags=None)

(Re)Insert the patches in the splits, according to split and class ratios

Parameters
  • max_patches – Number of patches to be read. If None use all images.

  • split_ratios – Ratio among training, validation and test. If None use the current value.

  • augs – Data augmentations to be used. If None use the current ones.

  • balance – Ratio among the different classes. If None use the current value.

  • seed – Seed for random generators

  • bags – User provided bags for the each split

Returns

Return type

For example, creating three splits (training, validation and test), with a total of one million patches and proportions respectively 70%, 20% and 10%:

cd.split_setup(
  split_ratios=[7,2,1],
  balance=[1,1],
  max_patches=1000000
)

The option balance asks the split manager to choose images such as to achieve a desired balance balance among the classes (in this case 1:1). In the example, the algorithm will try to fill the training set with 700,000 images, half of them of class 0 (e.g., normal) and the other half class 1 (e.g., tumor). If there are not enough images the loader will choose the maximum value that allows to maintain the desired balance.

Same split ratios, but using all the images in the DB and ignoring the balance among classes:

cd.split_setup(
  split_ratios=[7,2,1],
)

Apply some ECVL augmentations when loading the data:

training_augs = ecvl.SequentialAugmentationContainer(
    [
        ecvl.AugMirror(0.5),
        ecvl.AugFlip(0.5),
        ecvl.AugRotate([-180, 180]),
    ]
)
augs = [training_augs, None, None]
cd.split_setup(
    split_ratios=[7, 2, 1],
    augs=augs,
)

Create 10 splits, using a total of one million patches:

cd.split_setup(
  split_ratios=[1]*10,
  max_patches=1000000
)

To set the batch size and specify to generate only full batches (i.e., 32 images also in the last batch):

cd.set_batchsize(32, full_batches=True)

Once the splits have been created, they can easily be saved (together with all the table information), using the save_splits method and then reloaded with load_splits.

CassandraDataset.save_splits(self, filename)

Save list of split ids.

Parameters

filename – Local filename, as string

Returns

Return type

CassandraDataset.load_splits(self, filename, augs=None)

Load list of split ids and optionally set batch_size and augmentations.

Parameters
  • filename – Local filename, as string

  • augs – Data augmentations to be used. If None use the current ones.

Returns

Return type

For example:

cd.save_splits(
  'splits/1M_3splits.pckl'
)

And, to load an already existing split file:

from cassandra_dataset import CassandraDataset
from cassandra.auth import PlainTextAuthProvider

## Cassandra connection parameters
ap = PlainTextAuthProvider(username='user', password='pass')
cd = CassandraDataset(ap, ['cassandra-db'])
cd.load_splits(
  'splits/1M_3splits.pckl'
)

Once the splits are setup, it is finally possible to load batches of features and labels and pass them to a DeepHealth application, as shown in the following example:

epochs = 50
split = 0 # training
cd.set_batchsize(32)
for _ in range(epochs):
    cd.rewind_splits(shuffle=True)
    for _ in range(cd.num_batches[split]):
        x,y = cd.load_batch(split)
        ## feed features and labels to DL engine [...]
CassandraDataset.set_batchsize(self, bs, full_batches=False)

Change dataset batch size

Parameters
  • bs – Batch size when loading data

  • full_batches – Use only full batches

Returns

Return type

CassandraDataset.rewind_splits(self, chosen_split=None, shuffle=False)

Rewind/reshuffle rows in chosen split and reset its current index

Parameters
  • chosen_split – Split to be rewinded. If None rewind all the splits.

  • shuffle – Apply random permutation (def: False)

Returns

Return type

CassandraDataset.num_batches = []

Number of batches for each split

CassandraDataset.load_batch(self, split=None)

Read a batch from Cassandra DB.

Parameters

split – Split to read from (default to current_split)

Returns

(x,y) with x tensor of features and y tensor of labels

Return type