From charlesreid1

This page covers the use of fuel, a library for easy loading of data sets for machine learning and neural network applications.

We begin with #Basic Usage and the core classes of fuel.

Then we move on to #Advanced Usage and how to make the most of fuel in a practical way.

Then we cover #Workflows and how to use fuel in machine learning pipelines.

Basic Usage

We begin with an overview of the basic types of classes in fuel:

Datasets

Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.

Link to Dataset API documentation: https://fuel.readthedocs.io/en/latest/api/dataset.html?highlight=Dataset

IterableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

Now we can create a Dataset to iterate over the data:

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

and we can access each attribute using the dataset object:

In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').

In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').

In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).

In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.

In [14]: from pprint import pprint

In [15]: pprint(dir(dataset))
[

...snip...

 'apply_default_transformers',
 'axis_labels',
 'close',
 'default_transformers',
 'example_iteration_scheme',
 'filter_sources',
 'get_data',
 'get_example_stream',
 'iterables',
 'next_epoch',
 'num_examples',
 'open',
 'provides_sources',
 'reset',
 'sources']

Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:

In [17]: state = dataset.open()

In [18]: while True:
    ...:     try:
    ...:         print(dataset.get_data(state=state))
    ...:     except StopIteration:
    ...:         print('Iterator finished')
    ...:         break
    ...:
(array([[ 47, 211],
       [ 38,  53]]), array([0]))
(array([[204, 116],
       [152, 249]]), array([3]))
(array([[143, 177],
       [ 23, 233]]), array([0]))
(array([[154,  30],
       [171, 158]]), array([1]))
(array([[236, 124],
       [ 26, 118]]), array([2]))
(array([[186, 120],
       [112, 220]]), array([2]))
(array([[ 69,  80],
       [201, 127]]), array([2]))
(array([[246, 254],
       [175,  50]]), array([3]))
Iterator finished

To reset the state, use the Dataset object's reset() function. To finish, use the close() function.

In [19]: state = dataset.reset(state=state)

In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
       [ 38,  53]]), array([0]))

In [21]: dataset.close(state=state)

IndexableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.


In [1]: from fuel.datasets import IndexableDataset
   ...: from collections import OrderedDict

In [2]: import numpy
   ...: seed = 1234
   ...: rng = numpy.random.RandomState(seed)

In [3]: features = rng.randint(256, size=(8, 2, 2))
   ...: targets = rng.randint(4, size=(8, 1))

In [4]: dataset = IndexableDataset(
   ...:     indexables=OrderedDict([('features', features), ('targets', targets)]),
   ...:     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...:                              ('targets', ('batch', 'index'))]))

In [5]: state = dataset.open()

In [6]: print("State is {}".format(state))
   ...: print("NOTE: None state returned, because there is no state to maintain!")

State is None
NOTE: None state returned, because there is no state to maintain!

In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
(array([[[154,  30],
        [171, 158]],

       [[204, 116],
        [152, 249]],

       [[ 47, 211],
        [ 38,  53]]]), array([[1],
       [3],
       [0]]))

In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
(array([[[204, 116],
        [152, 249]],

       [[143, 177],
        [ 23, 233]],

       [[236, 124],
        [ 26, 118]],

       [[246, 254],
        [175,  50]]]), array([[3],
       [0],
       [2],
       [3]]))

In [9]: dataset.close(state=state)

No need to reset any iterator.


Note the main difference between the constructor arguments: IndexableDataset requires indexables dict, IterableDataset requires iterables dict:

dataset = IndexableDataset(
     indexables=OrderedDict([('features', features), ('targets', targets)]),
     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                              ('targets', ('batch', 'index'))]))

dataset = IterableDataset(
            iterables=OrderedDict([('features', features), ('targets', targets)]),
            axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

Iteration Schemes

Iteration schemes are a way of extracting slices or indices of data to be used for assembling data sets.

Here are the available options:

For generating batches of examples:

  • BatchScheme - useful for returning slices or indices in batches; examples (particular indices to use) can be specified as a list, or as range(N); batch size can be specified using an integer
  • BatchSizeScheme - similar to BatchScheme, but useful for "infinite" data sets where we don't want to provide particular indices of particular examples
  • ConstantScheme - returns a batch of constant size each time (can do this a specified number of times, or for infinity)

For generating one example at a time:

  • IndexScheme - iteration scheme to return single indices only (similar to BatchScheme, this supports passing particular indices, but only returns one example at a time)

For generating sequential examples:

  • SequentialExampleScheme - iterate through examples in sequential order
  • SequentialScheme - iterate through examples in sequential order, in batches of a given size

For generating examples in shuffled order:

  • ShuffledExampleScheme - generates single examples, one at a time, in shuffled order
  • ShuffledScheme - generates shuffled batches (creates a shuffled list of indices in memory, which can be memory-intensive and slow for data sets with millions of elements)

For more complicated schemes or combinations of schemes:

  • ConcatenatedScheme - iterator that concatenates multiple iterator schemes


ShuffledScheme Example

Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.

Incorrect Usage

Suppose we created an IterableDataset, as in the first example, and tried to iterate over it in arbitrary order:

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

The problem with doing this is, the get_data() function for an IterableDataset does not support any extra arguments (like request), so we can't request data out of the standard iteration order. What happens if we do? We get a ValueError...

In [23]: from fuel.schemes import ShuffledScheme

In [24]: state = dataset.open()

In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

In [26]: for request in scheme.get_request_iterator():
    ...:     data = dataset.get_data(state=state, request=request)
    ...:     print(data[0].shape, data[1].shape)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-24827dafdaa8> in <module>()
      1 for request in scheme.get_request_iterator():
----> 2     data = dataset.get_data(state=state, request=request)
      3     print(data[0].shape, data[1].shape)
      4

/usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
    310     def get_data(self, state=None, request=None):
    311         if state is None or request is not None:
--> 312             raise ValueError
    313         return next(state)
    314

ValueError:

Correct Usage

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

If we create our data set using an IndexableDataset object, this is the correct way to do it, and everything goes smoothly.

from fuel.datasets import IndexableDataset
from fuel.schemes import ShuffledScheme
from collections import OrderedDict

import numpy
seed = 1234
rng = numpy.random.RandomState(seed)

# Make some fake data
features = rng.randint(256, size=(8, 2, 2))
targets = rng.randint(4, size=(8, 1))

# Make a Dataset - in particular, an IndexableDataset
dataset = IndexableDataset(
            indexables=OrderedDict([('features', features), ('targets', targets)]),
            axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

state = dataset.open()
scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

# Use get_request_iterator() to generate requests
# in shuffled order using the ShuffledScheme.

for request in scheme.get_request_iterator():
    print(request)

print("\n")

for request in scheme.get_request_iterator():
    data = dataset.get_data(state=state, request=request)
    print(data[0].shape, data[1].shape)

Here is the corresponding output:

$ py scheme_shuffled_example.py
[7, 2, 1, 6]
[0, 4, 3, 5]


(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)

Note the first two lines of output are what the get_request_iterator() method returned - we asked the scheme to get data in batch sizes of 4, using batch_size=4, and we specified the batch was the first of the three dimensions of the entire (8, 2, 2) data set of "fake" data.

scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

This means it's going to grab 4 chunks of data, each (2,2). Sure enough, with the second two lines of output we see the shapes of the data being returned. Let's examine what that data actually contains. If instead of printing shapes, we print data[0], we see the actual data from the "fake" grayscale images (INPUTS):

[[[143 177]
  [ 23 233]]

 [[154  30]
  [171 158]]

 [[236 124]
  [ 26 118]]

 [[246 254]
  [175  50]]]

--- --- --- --- --- --- ---

[[[204 116]
  [152 249]]

 [[ 69  80]
  [201 127]]

 [[ 47 211]
  [ 38  53]]

 [[186 120]
  [112 220]]]

Now, if we print data[1], we see which of the four predicted classes each image is a part of (0 through 3) (OUTPUTS):

[[0]
 [1]
 [2]
 [3]]

--- --- --- --- --- --- ---

[[3]
 [2]
 [0]
 [2]]

Data Streams

A DataStream object is an iterable stream of examples/minibatches.

The DataStream class is the most common class here. Add a from fuel.streams import DataStream to import.

When constructing the DataStream, you pass it your Dataset object, and optionally your iteration_scheme object.

DataStream Documentation Links

Link to DataStream documentation: https://fuel.readthedocs.io/en/latest/api/data_streams.html

AbstractDataStream documentation: https://fuel.readthedocs.io/en/latest/api/data_streams.html#module-fuel.streams

DataStream documentation: https://fuel.readthedocs.io/en/latest/api/data_streams.html#fuel.streams.DataStream

DataStream Example

Use the same technique as before to generate fake data. Create an IndexableDataset, and create a scheme for iterating through the data (using a ShuffleScheme here).

from fuel.datasets import IndexableDataset
from fuel.schemes import ShuffledScheme
from fuel.streams import DataStream
from collections import OrderedDict

import numpy
seed = 1234
rng = numpy.random.RandomState(seed)
n = 32 

# Make some fake data
features = rng.randint(256, size=(n, 2, 2))
targets = rng.randint(4, size=(n, 1))

# Make a Dataset - in particular, an IndexableDataset
dataset = IndexableDataset(
            indexables=OrderedDict([('features', features), ('targets', targets)]),
            axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

now that we have a data set and a scheme, we can initialize the stream with these:

ds = DataStream(dataset, iteration_scheme=scheme)

There are a couple of ways to iterate through the data stream. The easiest is just to use get_epoch_iterator() to create an iterator to iterate through each of the input/output value pairs:

for data in ds.get_epoch_iterator():
    print('------------')
    print(data[0])
    print(data[1])

The output:

------------
[[[246 254]
  [175  50]]

 [[143 177]
  [ 23 233]]

 [[204 116]
  [152 249]]

 [[ 69  80]
  [201 127]]]
[[3]
 [0]
 [3]
 [2]]
------------
[[[ 47 211]
  [ 38  53]]

 [[236 124]
  [ 26 118]]

 [[154  30]
  [171 158]]

 [[186 120]
  [112 220]]]
[[0]
 [2]
 [1]
 [2]]

However, there are other ways to iterate through as well. If we wanted to be able to access the axis_labels by names (recall that for our example the names are "features" and "targets"), we could pass the argument as_dict=True and get values keyed by their axis label:

for d in ds.get_epoch_iterator(as_dict=True):
    print('------------')
    print("Features:")
    print(d['features'])
    print("Targets:")
    print(d['targets'])

This would result in the following output:

------------
features:
[[[143 177]
  [ 23 233]]

 [[154  30]
  [171 158]]

 [[236 124]
  [ 26 118]]

 [[246 254]
  [175  50]]]


targets:
[[0]
 [1]
 [0]
 [1]]


------------
features:
[[[204 116]
  [152 249]]

 [[ 69  80]
  [201 127]]

 [[ 47 211]
  [ 38  53]]

 [[186 120]
  [112 220]]]


targets:
[[1]
 [0]
 [0]
 [0]]

DataStream Example: Higher Dimensional Dataset

Suppose we are working on a classification task, and we have a set of images of three kinds of animals (dogs, cats, and snakes). We want to pick three of each animal, and determine if these three animals could be friends - so our response is a yes/no (or, a binary 0 or 1).

We can create a Dataset object that has multiple index labels and axes. For example, in our case we would want three inputs (we'll stick to our 2x2 fake grayscale images), and one output (vector of 0s and 1s). Then, when we create the IndexableDataset, we would pass "dogs", "cats", and "snakes" as our data set index labels, and pass "dogs"/"cats"/"snakes"/"targets" as our axis labels (only one column of outputs). Here's what that would look like:

from fuel.datasets import IndexableDataset
from fuel.schemes import ShuffledScheme
from fuel.streams import DataStream
from collections import OrderedDict

import numpy
seed = 1234
rng = numpy.random.RandomState(seed)
n = 16 

# Make some fake data
dogs    = rng.randint(256, size=(n, 2, 2))
cats    = rng.randint(256, size=(n, 2, 2))
snakes  = rng.randint(256, size=(n, 2, 2))
targets = rng.randint(4, size=(n, 1))

# Make a Dataset - in particular, an IndexableDataset
dataset = IndexableDataset(
            indexables=OrderedDict([('dogs', dogs), ('cats', cats), ('snakes', snakes), ('targets', targets)]),
            axis_labels=OrderedDict([('dogs', ('batch', 'height', 'width')),
                                     ('cats', ('batch', 'height', 'width')),
                                     ('snakes', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

Now we can do the same thing as before - create an iteration scheme, pass it to our DataStream object, and now we have access to all three inputs thanks to the ability to get an epoch iterator as a dictionary:

scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

ds = DataStream(dataset, iteration_scheme=scheme)

for data in ds.get_epoch_iterator():
    print('------------')
    print(len(data))

for d in ds.get_epoch_iterator(as_dict=True):
    print('------------')
    for k in d.keys():
        print(k)
        print(d[k])
        print("\n")

The resulting output:

------------
dogs
[[[ 47 211]
  [ 38  53]]

 [[ 69  80]
  [201 127]]

 [[204 116]
  [152 249]]

 [[246 254]
  [175  50]]]


cats
[[[ 81  87]
  [ 13 116]]

 [[ 86 172]
  [218 211]]

 [[ 96 140]
  [197 253]]

 [[ 47 177]
  [ 18  85]]]


snakes
[[[105   0]
  [121  98]]

 [[  1 142]
  [  3  30]]

 [[249  90]
  [161 114]]

 [[140 245]
  [201 109]]]


targets
[[1]
 [3]
 [2]
 [2]]


------------
dogs
[[[ 75  80]
  [  3   2]]

 [[ 19 140]
  [193 203]]

 [[231 139]
  [128 233]]

 [[234 107]
  [174 156]]]


cats
[[[244 118]
  [175 143]]

 [[ 34  10]
  [ 28   4]]

 [[133 238]
  [ 47 246]]

 [[249  62]
  [183  84]]]


snakes
[[[241 152]
  [222 233]]

 [[ 15  72]
  [130 144]]

 [[184 212]
  [136 172]]

 [[183  36]
  [ 88 161]]]


targets
[[1]
 [0]
 [1]
 [3]]


------------
dogs
[[[143 177]
  [ 23 233]]

 [[186 120]
  [112 220]]

 [[ 14 243]
  [199  60]]

 [[236 124]
  [ 26 118]]]


cats
[[[113 223]
  [229 159]]

 [[184 236]
  [ 70 184]]

 [[235  78]
  [151 178]]

 [[ 45  16]
  [ 41  72]]]


snakes
[[[121 241]
  [ 21 199]]

 [[116 105]
  [114 169]]

 [[195  46]
  [226  57]]

 [[250 180]
  [192 213]]]


targets
[[2]
 [0]
 [1]
 [1]]


Transformers

So far we have covered:

  • Dataset objects, which are stateless objects storing the data and labels for the data axes
  • IndexableDataset objects, which allow arbitrary iteration schema (more flexible)
  • IterableDataset objects, which allow specific iteration schema (less flexible)
  • Scheme objects, which allow us to generate schema for iterating through data sets that match certain specifications that we wish to set (batch size, axes, order, etc.)
  • Streams, which combine datasets and schemes to allow normal-looking iteration through a data set

One last thing we have not covered is, how to apply transformations to our data as we load/access it. That's where Transformer objects come in.

Flags