cassandradl.CassandraDataset class
The only class that the user needs to interact with, in order to use
the Cassandra Data Loader, is cassandra_dataset.CassandraDataset.
- CassandraDataset.__init__(self, auth_prov, cassandra_ips, port=9042, seed=None)
Create ECVL Dataset from Cassandra DB
- Parameters
auth_prov – Authenticator for Cassandra
cassandra_ips – List of Cassandra ip’s
port – TCP port to connect to (default: 9042)
seed – Seed for random generators
- Returns
- Return type
Th class must be initialized with the credentials and the hostname for connecting to the Cassandra DB, as in the following example:
from cassandra_dataset import CassandraDataset
from cassandra.auth import PlainTextAuthProvider
## Cassandra connection parameters
ap = PlainTextAuthProvider(username='user', password='pass')
cd = CassandraDataset(ap, ['cassandra-db'])
The next step is initializing a list manager, reading metadata from the DB.
- CassandraDataset.init_listmanager(self, table, id_col, label_col='label', label_map=[], grouping_cols=[], num_classes=2, metatable=None)
Initialize the Cassandra list manager.
It takes care of loading/saving the full list of rows from the DB and creating the splits according to the user input.
- Parameters
table – Metadata by natural keys
id_col – Cassandra id column for the images (e.g., ‘patch_id’)
label_col – Cassandra label column (default: ‘label’)
label_map – Transformation map for labels (e.g., [1,0] inverts the two classes)
grouping_cols – Columns to group by (e.g., [‘patient_id’])
num_classes – Number of classes (default: 2)
metatable – Metadata by uuid patch_id (optional)
- Returns
- Return type
- CassandraDataset.read_rows_from_db(self, sample_whitelist=None)
Read the full list of rows from the DB.
- Parameters
sample_whitelist – Whitelist for group keys
- Returns
- Return type
- CassandraDataset.init_datatable(self, table, data_col='data')
Setup queries for db table containing raw data
- Parameters
table – Data table, index by the uuid
data_col – Cassandra blob image column (default: ‘data’)
- Returns
- Return type
For example:
cd.init_listmanager(
table='patches.metadata_by_nat',
id_col='patch_id',
label_col="label",
grouping_cols=["patient_id"],
num_classes=2
)
cd.read_rows_from_db()
cd.init_datatable(
table='patches.data_by_uuid'
)
The parameter grouping_cols (optionally) specifies the columns by
which the images should be grouped, before dividing the groups among
the splits. For example, in digital pathology contexts we typically
want that all images of the same patient go either into the training
or the validation set. If grouping_cols is not specified, then
each image forms a group by itself (i.e., a singlet).
After the list manager has been initialized and the metadata has been
read from the DB, the splits can be created automatically, using the
split_setup method.
- CassandraDataset.split_setup(self, max_patches=None, split_ratios=None, augs=None, balance=None, seed=None, bags=None)
(Re)Insert the patches in the splits, according to split and class ratios
- Parameters
max_patches – Number of patches to be read. If None use all images.
split_ratios – Ratio among training, validation and test. If None use the current value.
augs – Data augmentations to be used. If None use the current ones.
balance – Ratio among the different classes. If None use the current value.
seed – Seed for random generators
bags – User provided bags for the each split
- Returns
- Return type
For example, creating three splits (training, validation and test), with a total of one million patches and proportions respectively 70%, 20% and 10%:
cd.split_setup(
split_ratios=[7,2,1],
balance=[1,1],
max_patches=1000000
)
The option balance asks the split manager to choose images such as
to achieve a desired balance balance among the classes (in this case 1:1).
In the example, the algorithm will try to fill the training
set with 700,000 images, half of them of class 0 (e.g., normal) and
the other half class 1 (e.g., tumor). If there are not enough images the loader
will choose the maximum value that allows to maintain the desired balance.
Same split ratios, but using all the images in the DB and ignoring the balance among classes:
cd.split_setup(
split_ratios=[7,2,1],
)
Apply some ECVL augmentations when loading the data:
training_augs = ecvl.SequentialAugmentationContainer(
[
ecvl.AugMirror(0.5),
ecvl.AugFlip(0.5),
ecvl.AugRotate([-180, 180]),
]
)
augs = [training_augs, None, None]
cd.split_setup(
split_ratios=[7, 2, 1],
augs=augs,
)
Create 10 splits, using a total of one million patches:
cd.split_setup(
split_ratios=[1]*10,
max_patches=1000000
)
To set the batch size and specify to generate only full batches (i.e., 32 images also in the last batch):
cd.set_batchsize(32, full_batches=True)
Once the splits have been created, they can easily be saved (together
with all the table information), using the save_splits method and
then reloaded with load_splits.
- CassandraDataset.save_splits(self, filename)
Save list of split ids.
- Parameters
filename – Local filename, as string
- Returns
- Return type
- CassandraDataset.load_splits(self, filename, augs=None)
Load list of split ids and optionally set batch_size and augmentations.
- Parameters
filename – Local filename, as string
augs – Data augmentations to be used. If None use the current ones.
- Returns
- Return type
For example:
cd.save_splits(
'splits/1M_3splits.pckl'
)
And, to load an already existing split file:
from cassandra_dataset import CassandraDataset
from cassandra.auth import PlainTextAuthProvider
## Cassandra connection parameters
ap = PlainTextAuthProvider(username='user', password='pass')
cd = CassandraDataset(ap, ['cassandra-db'])
cd.load_splits(
'splits/1M_3splits.pckl'
)
Once the splits are setup, it is finally possible to load batches of features and labels and pass them to a DeepHealth application, as shown in the following example:
epochs = 50
split = 0 # training
cd.set_batchsize(32)
for _ in range(epochs):
cd.rewind_splits(shuffle=True)
for _ in range(cd.num_batches[split]):
x,y = cd.load_batch(split)
## feed features and labels to DL engine [...]
- CassandraDataset.set_batchsize(self, bs, full_batches=False)
Change dataset batch size
- Parameters
bs – Batch size when loading data
full_batches – Use only full batches
- Returns
- Return type
- CassandraDataset.rewind_splits(self, chosen_split=None, shuffle=False)
Rewind/reshuffle rows in chosen split and reset its current index
- Parameters
chosen_split – Split to be rewinded. If None rewind all the splits.
shuffle – Apply random permutation (def: False)
- Returns
- Return type
- CassandraDataset.num_batches = []
Number of batches for each split
- CassandraDataset.load_batch(self, split=None)
Read a batch from Cassandra DB.
- Parameters
split – Split to read from (default to current_split)
- Returns
(x,y) with x tensor of features and y tensor of labels
- Return type