|
|
| (8 intermediate revisions by the same user not shown) |
| Line 17: |
Line 17: |
| Now you can install Fuel. | | Now you can install Fuel. |
|
| |
|
| ==Install== | | ==Install Fuel from Source== |
|
| |
|
| <pre> | | <pre> |
| $ git clone git@github.com:/mila-udem/fuel.git | | $ git clone git@github.com:/mila-udem/fuel.git |
| $ cd fuel | | $ cd fuel |
| $ python setup.py build && python setup.py install | | $ python setup.py build |
| | $ python setup.py install |
| </pre> | | </pre> |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| =Basic Usage= | | =Basic Usage= |
|
| |
|
| See [[Fuel/Usage]]
| | {{Main|Fuel/Usage}} |
| | |
| | |
| ==Datasets==
| |
| | |
| Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.
| |
| | |
| ===IterableDataset Example===
| |
| | |
| Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
| |
| | |
| Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":
| |
| | |
| <pre>
| |
| In [1]: import numpy
| |
| | |
| In [2]: seed = 1234
| |
| | |
| In [3]: rng = numpy.random.RandomState(seed)
| |
| | |
| In [4]: features = rng.randint(256, size=(8, 2, 2))
| |
| | |
| In [5]: targets = rng.randint(4, size=(8, 1))
| |
| </pre>
| |
| | |
| Now we can create a Dataset to iterate over the data:
| |
| | |
| <pre>
| |
| In [6]: from collections import OrderedDict
| |
| | |
| In [7]: from fuel.datasets import IterableDataset
| |
| | |
| In [8]: dataset = IterableDataset(
| |
| ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
| |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ...: ('targets', ('batch', 'index'))]))
| |
| </pre>
| |
| | |
| and we can access each attribute using the dataset object:
| |
| | |
| <pre>
| |
| In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
| |
| Provided sources are ('features', 'targets').
| |
| | |
| In [10]: print('Sources are {}.'.format(dataset.sources))
| |
| Sources are ('features', 'targets').
| |
| | |
| In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
| |
| Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).
| |
| | |
| In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
| |
| Dataset contains 8 examples.
| |
| | |
| In [14]: from pprint import pprint
| |
| | |
| In [15]: pprint(dir(dataset))
| |
| [
| |
| | |
| ...snip...
| |
| | |
| 'apply_default_transformers',
| |
| 'axis_labels',
| |
| 'close',
| |
| 'default_transformers',
| |
| 'example_iteration_scheme',
| |
| 'filter_sources',
| |
| 'get_data',
| |
| 'get_example_stream',
| |
| 'iterables',
| |
| 'next_epoch',
| |
| 'num_examples',
| |
| 'open',
| |
| 'provides_sources',
| |
| 'reset',
| |
| 'sources']
| |
| </pre>
| |
| | |
| Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:
| |
| | |
| <pre>
| |
| In [17]: state = dataset.open()
| |
| | |
| In [18]: while True:
| |
| ...: try:
| |
| ...: print(dataset.get_data(state=state))
| |
| ...: except StopIteration:
| |
| ...: print('Iterator finished')
| |
| ...: break
| |
| ...:
| |
| (array([[ 47, 211],
| |
| [ 38, 53]]), array([0]))
| |
| (array([[204, 116],
| |
| [152, 249]]), array([3]))
| |
| (array([[143, 177],
| |
| [ 23, 233]]), array([0]))
| |
| (array([[154, 30],
| |
| [171, 158]]), array([1]))
| |
| (array([[236, 124],
| |
| [ 26, 118]]), array([2]))
| |
| (array([[186, 120],
| |
| [112, 220]]), array([2]))
| |
| (array([[ 69, 80],
| |
| [201, 127]]), array([2]))
| |
| (array([[246, 254],
| |
| [175, 50]]), array([3]))
| |
| Iterator finished
| |
| </pre>
| |
| | |
| To reset the state, use the Dataset object's reset() function. To finish, use the close() function.
| |
| | |
| <pre>
| |
| In [19]: state = dataset.reset(state=state)
| |
| | |
| In [20]: print(dataset.get_data(state=state))
| |
| (array([[ 47, 211],
| |
| [ 38, 53]]), array([0]))
| |
| | |
| In [21]: dataset.close(state=state)
| |
| </pre>
| |
| | |
| ===IndexableDataset Example===
| |
| | |
| Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
| |
| | |
| IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.
| |
| | |
| <pre>
| |
| | |
| In [1]: from fuel.datasets import IndexableDataset
| |
| ...: from collections import OrderedDict
| |
| | |
| In [2]: import numpy
| |
| ...: seed = 1234
| |
| ...: rng = numpy.random.RandomState(seed)
| |
| | |
| In [3]: features = rng.randint(256, size=(8, 2, 2))
| |
| ...: targets = rng.randint(4, size=(8, 1))
| |
| | |
| In [4]: dataset = IndexableDataset(
| |
| ...: indexables=OrderedDict([('features', features), ('targets', targets)]),
| |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ...: ('targets', ('batch', 'index'))]))
| |
| | |
| In [5]: state = dataset.open()
| |
| | |
| In [6]: print("State is {}".format(state))
| |
| ...: print("NOTE: None state returned, because there is no state to maintain!")
| |
| | |
| State is None
| |
| NOTE: None state returned, because there is no state to maintain!
| |
| | |
| In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
| |
| (array([[[154, 30],
| |
| [171, 158]],
| |
| | |
| [[204, 116],
| |
| [152, 249]],
| |
| | |
| [[ 47, 211],
| |
| [ 38, 53]]]), array([[1],
| |
| [3],
| |
| [0]]))
| |
| | |
| In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
| |
| (array([[[204, 116],
| |
| [152, 249]],
| |
| | |
| [[143, 177],
| |
| [ 23, 233]],
| |
| | |
| [[236, 124],
| |
| [ 26, 118]],
| |
| | |
| [[246, 254],
| |
| [175, 50]]]), array([[3],
| |
| [0],
| |
| [2],
| |
| [3]]))
| |
| | |
| In [9]: dataset.close(state=state)
| |
| </pre>
| |
| | |
| No need to reset any iterator.
| |
| | |
| | |
| Note the main difference between the constructor arguments: IndexableDataset requires indexables dict, IterableDataset requires iterables dict:
| |
| | |
| <pre>
| |
| dataset = IndexableDataset(
| |
| indexables=OrderedDict([('features', features), ('targets', targets)]),
| |
| axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ('targets', ('batch', 'index'))]))
| |
| | |
| dataset = IterableDataset(
| |
| iterables=OrderedDict([('features', features), ('targets', targets)]),
| |
| axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ('targets', ('batch', 'index'))]))
| |
| </pre>
| |
| | |
| ==Iteration Schemes==
| |
| | |
| ===Iteration Scheme Examples===
| |
| | |
| Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.
| |
| | |
| ====Incorrect Usage====
| |
| | |
| Suppose we created an IterableDataset, as in the first example, and tried to iterate over it in arbitrary order:
| |
|
| |
|
| <pre>
| | Summary: |
| In [8]: dataset = IterableDataset(
| | * [[Fuel/Usage#Datasets|Datasets]] are the principal interface to data, but are abstract classes |
| ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
| | * [[Fuel/Usage#IterableDataset Example|IterableDatasets]] (less powerful) allow sequential access to data in specified order only |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| | * [[Fuel/Usage#IndexableDataset Example|IndexableDatasets]] (more powerful) allow random access to data |
| ...: ('targets', ('batch', 'index'))]))
| | * [[Fuel/Usage#Iteration Schemes|Schemes]] allow iterating through IndexablelDatasets in various orders (batch, sequential, shuffle, etc.) |
| </pre>
| |
| | |
| The problem with doing this is, the get_data() function for an IterableDataset does not support any extra arguments (like request), so we can't request data out of the standard iteration order. What happens if we do? We get a ValueError...
| |
| | |
| <pre>
| |
| In [23]: from fuel.schemes import ShuffledScheme
| |
| | |
| In [24]: state = dataset.open()
| |
| | |
| In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
| |
| | |
| In [26]: for request in scheme.get_request_iterator():
| |
| ...: data = dataset.get_data(state=state, request=request)
| |
| ...: print(data[0].shape, data[1].shape)
| |
| ...:
| |
| ---------------------------------------------------------------------------
| |
| ValueError Traceback (most recent call last)
| |
| <ipython-input-27-24827dafdaa8> in <module>()
| |
| 1 for request in scheme.get_request_iterator():
| |
| ----> 2 data = dataset.get_data(state=state, request=request)
| |
| 3 print(data[0].shape, data[1].shape)
| |
| 4
| |
| | |
| /usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
| |
| 310 def get_data(self, state=None, request=None):
| |
| 311 if state is None or request is not None:
| |
| --> 312 raise ValueError
| |
| 313 return next(state)
| |
| 314
| |
| | |
| ValueError:
| |
| </pre>
| |
| | |
| ====Correct Usage====
| |
| | |
| Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
| |
| | |
| If we create our data set using an IndexableDataset object, this is the correct way to do it, and everything goes smoothly.
| |
| | |
| <pre>
| |
| from fuel.datasets import IndexableDataset
| |
| from fuel.schemes import ShuffledScheme
| |
| from collections import OrderedDict
| |
| | |
| import numpy
| |
| seed = 1234
| |
| rng = numpy.random.RandomState(seed)
| |
| | |
| # Make some fake data
| |
| features = rng.randint(256, size=(8, 2, 2))
| |
| targets = rng.randint(4, size=(8, 1))
| |
| | |
| # Make a Dataset - in particular, an IndexableDataset | |
| dataset = IndexableDataset(
| |
| indexables=OrderedDict([('features', features), ('targets', targets)]),
| |
| axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ('targets', ('batch', 'index'))]))
| |
| | |
| state = dataset.open()
| |
| scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
| |
| | |
| # Use get_request_iterator() to generate requests
| |
| # in shuffled order using the ShuffledScheme.
| |
| | |
| for request in scheme.get_request_iterator():
| |
| print(request)
| |
| | |
| print("\n")
| |
| | |
| for request in scheme.get_request_iterator():
| |
| data = dataset.get_data(state=state, request=request)
| |
| print(data[0].shape, data[1].shape)
| |
| </pre>
| |
| | |
| Here is the corresponding output:
| |
| | |
| <pre>
| |
| $ py iterator_example.py
| |
| [7, 2, 1, 6]
| |
| [0, 4, 3, 5]
| |
| | |
| | |
| (4, 2, 2) (4, 1)
| |
| (4, 2, 2) (4, 1)
| |
| </pre>
| |
| | |
| Note the first two lines of output are what the get_request_iterator() method returned - we asked the scheme to get data in batch sizes of 4, using batch_size=4, and we specified the batch was the first of the three dimensions of the entire (8, 2, 2) data set of "fake" data.
| |
| | |
| <pre>
| |
| scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
| |
| </pre>
| |
| | |
| This means it's going to grab 4 chunks of data, each (2,2). Sure enough, with the second two lines of output we see the shapes of the data being returned. Let's examine what that data actually contains. If instead of printing shapes, we print <code>data[0]</code>, we see the actual data from the "fake" grayscale images (INPUTS):
| |
| | |
| <pre>
| |
| [[[143 177]
| |
| [ 23 233]]
| |
| | |
| [[154 30]
| |
| [171 158]]
| |
| | |
| [[236 124]
| |
| [ 26 118]]
| |
| | |
| [[246 254]
| |
| [175 50]]]
| |
| | |
| --- --- --- --- --- --- ---
| |
| | |
| [[[204 116]
| |
| [152 249]]
| |
| | |
| [[ 69 80]
| |
| [201 127]]
| |
| | |
| [[ 47 211]
| |
| [ 38 53]]
| |
| | |
| [[186 120]
| |
| [112 220]]]
| |
| </pre>
| |
| | |
| Now, if we print <code>data[1]</code>, we see which of the four predicted classes each image is a part of (0 through 3) (OUTPUTS):
| |
| | |
| <pre>
| |
| [[0]
| |
| [1]
| |
| [2]
| |
| [3]]
| |
| | |
| --- --- --- --- --- --- ---
| |
| | |
| [[3]
| |
| [2]
| |
| [0]
| |
| [2]]
| |
| </pre>
| |
|
| |
|
| =Wrapping Custom Datasets with Fuel= | | =Wrapping Custom Datasets with Fuel= |
|
| |
|
| Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel
| | {{Main|Fuel/Custom Datasets}} |
| | |
| Advantages:
| |
| * Only takes one command to download the data and import it into fuel
| |
| * Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y
| |
| | |
| Disadvantages:
| |
| * One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
| |
| * Complicated to extend
| |
| * Removes some of the nicer options of fuel
| |
| | |
| Here is what the final payoff looks like:
| |
| | |
| <pre>
| |
| from keras.models import Sequential
| |
| from lfw_fuel import lfw
| |
|
| |
|
| # the data, shuffled and split between train and test sets
| | Basically, the process of wrapping a custom data set with fuel looks like this: |
| (X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled")
| | * Specify how the original data should be downloaded, processed, and turned into a fuel data set |
| | * Specify how the fuel data set should be loaded |
|
| |
|
| # (build the perfect model here)
| | The first step - defining how to turn original data into fuel data: |
| | * Create a download wrapper - this tells fuel how to download the original data ("briq" download?) |
| | * Define a way to load a single piece of data (e.g., parameterized by name) and, optionally, paired/related pieces of data (e.g., two related images) |
| | * Convert function to extract all data and assemble it all into an HDF5 file (and remove original data when finished) |
|
| |
|
| model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
| | The second step - specifying how the fuel data set should be loaded: |
| score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
| | * Create a fuel Datasets object (inheriting from, e.g., H5PYDataset) |
| </pre>
| | * Define a way for that data to be loaded (example: make a universally-available load_data method in a package specific to your data set, as in lfw_fuel) |
|
| |
|
| =Flags= | | =Flags= |
|
| |
|
| | | {{FuelFlag}} |
| [[Category:Data Engineering]]
| |
| [[Category:NN]]
| |
| [[Category:ML]]
| |