|
|
| (15 intermediate revisions by the same user not shown) |
| Line 17: |
Line 17: |
| Now you can install Fuel. | | Now you can install Fuel. |
|
| |
|
| ==Install== | | ==Install Fuel from Source== |
|
| |
|
| <pre> | | <pre> |
| $ git clone git@github.com:/mila-udem/fuel.git | | $ git clone git@github.com:/mila-udem/fuel.git |
| $ cd fuel | | $ cd fuel |
| $ python setup.py build && python setup.py install | | $ python setup.py build |
| | $ python setup.py install |
| </pre> | | </pre> |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| =Basic Usage= | | =Basic Usage= |
|
| |
|
| ==Datasets==
| | {{Main|Fuel/Usage}} |
| | |
| Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.
| |
| | |
| ===IterableDataset Example===
| |
| | |
| Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
| |
| | |
| Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":
| |
| | |
| <pre>
| |
| In [1]: import numpy
| |
| | |
| In [2]: seed = 1234
| |
| | |
| In [3]: rng = numpy.random.RandomState(seed)
| |
| | |
| In [4]: features = rng.randint(256, size=(8, 2, 2))
| |
| | |
| In [5]: targets = rng.randint(4, size=(8, 1))
| |
| </pre>
| |
| | |
| Now we can create a Dataset to iterate over the data:
| |
| | |
| <pre>
| |
| In [6]: from collections import OrderedDict
| |
| | |
| In [7]: from fuel.datasets import IterableDataset
| |
| | |
| In [8]: dataset = IterableDataset(
| |
| ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
| |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ...: ('targets', ('batch', 'index'))]))
| |
| </pre>
| |
| | |
| and we can access each attribute using the dataset object:
| |
| | |
| <pre>
| |
| In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
| |
| Provided sources are ('features', 'targets').
| |
| | |
| In [10]: print('Sources are {}.'.format(dataset.sources))
| |
| Sources are ('features', 'targets').
| |
| | |
| In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
| |
| Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).
| |
| | |
| In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
| |
| Dataset contains 8 examples.
| |
| | |
| In [14]: from pprint import pprint
| |
| | |
| In [15]: pprint(dir(dataset))
| |
| [
| |
| | |
| ...snip...
| |
| | |
| 'apply_default_transformers',
| |
| 'axis_labels',
| |
| 'close',
| |
| 'default_transformers',
| |
| 'example_iteration_scheme',
| |
| 'filter_sources',
| |
| 'get_data',
| |
| 'get_example_stream',
| |
| 'iterables',
| |
| 'next_epoch',
| |
| 'num_examples',
| |
| 'open',
| |
| 'provides_sources',
| |
| 'reset',
| |
| 'sources']
| |
| </pre>
| |
| | |
| Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:
| |
| | |
| <pre>
| |
| In [17]: state = dataset.open()
| |
| | |
| In [18]: while True:
| |
| ...: try:
| |
| ...: print(dataset.get_data(state=state))
| |
| ...: except StopIteration:
| |
| ...: print('Iterator finished')
| |
| ...: break
| |
| ...:
| |
| (array([[ 47, 211],
| |
| [ 38, 53]]), array([0]))
| |
| (array([[204, 116],
| |
| [152, 249]]), array([3]))
| |
| (array([[143, 177],
| |
| [ 23, 233]]), array([0]))
| |
| (array([[154, 30],
| |
| [171, 158]]), array([1]))
| |
| (array([[236, 124],
| |
| [ 26, 118]]), array([2]))
| |
| (array([[186, 120],
| |
| [112, 220]]), array([2]))
| |
| (array([[ 69, 80],
| |
| [201, 127]]), array([2]))
| |
| (array([[246, 254],
| |
| [175, 50]]), array([3]))
| |
| Iterator finished
| |
| </pre>
| |
| | |
| To reset the state, use the Dataset object's reset() function. To finish, use the close() function.
| |
| | |
| <pre>
| |
| In [19]: state = dataset.reset(state=state)
| |
| | |
| In [20]: print(dataset.get_data(state=state))
| |
| (array([[ 47, 211],
| |
| [ 38, 53]]), array([0]))
| |
| | |
| In [21]: dataset.close(state=state)
| |
| </pre>
| |
| | |
| ===IndexableDataset Example===
| |
| | |
| IndexableDataset objects do not work the same way - there is no need to store a persistent state - all the data can be accessed randomly, in any order you please.
| |
| | |
| <pre>
| |
| | |
| In [1]: from fuel.datasets import IndexableDataset
| |
| ...: from collections import OrderedDict
| |
| | |
| In [2]: import numpy
| |
| ...: seed = 1234
| |
| ...: rng = numpy.random.RandomState(seed)
| |
| | |
| In [3]: features = rng.randint(256, size=(8, 2, 2))
| |
| ...: targets = rng.randint(4, size=(8, 1))
| |
| | |
| In [4]: dataset = IndexableDataset(
| |
| ...: indexables=OrderedDict([('features', features), ('targets', targets)]),
| |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| |
| ...: ('targets', ('batch', 'index'))]))
| |
| | |
| In [5]: state = dataset.open()
| |
| | |
| In [6]: print("State is {}".format(state))
| |
| ...: print("NOTE: None state returned, because there is no state to maintain!")
| |
| | |
| State is None
| |
| NOTE: None state returned, because there is no state to maintain!
| |
| | |
| In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
| |
| (array([[[154, 30],
| |
| [171, 158]],
| |
| | |
| [[204, 116],
| |
| [152, 249]],
| |
| | |
| [[ 47, 211],
| |
| [ 38, 53]]]), array([[1],
| |
| [3],
| |
| [0]]))
| |
| | |
| In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
| |
| (array([[[204, 116],
| |
| [152, 249]],
| |
| | |
| [[143, 177],
| |
| [ 23, 233]],
| |
| | |
| [[236, 124],
| |
| [ 26, 118]],
| |
| | |
| [[246, 254],
| |
| [175, 50]]]), array([[3],
| |
| [0],
| |
| [2],
| |
| [3]]))
| |
| | |
| In [9]: dataset.close(state=state)
| |
| </pre>
| |
| | |
| No need to reset any iterator.
| |
| | |
| ==Iteration Schemes==
| |
| | |
| ===Iteration Scheme Examples===
| |
| | |
| Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.
| |
| | |
| ====Incorrect Usage====
| |
| | |
| Recall above, we created a dummy data set of random integers of size (8,2,2) and created a Dataset from it:
| |
| | |
| <pre>
| |
| ~~~*~*~*~*~*~*~~~ flashback ~~~*~*~*~*~*~*~~~
| |
| | |
| In [1]: import numpy
| |
| | |
| In [2]: seed = 1234
| |
| | |
| In [3]: rng = numpy.random.RandomState(seed)
| |
| | |
| In [4]: features = rng.randint(256, size=(8, 2, 2))
| |
| | |
| In [5]: targets = rng.randint(4, size=(8, 1))
| |
| | |
| In [6]: from collections import OrderedDict
| |
| | |
| In [7]: from fuel.datasets import IterableDataset
| |
|
| |
|
| In [8]: dataset = IterableDataset(
| | Summary: |
| ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
| | * [[Fuel/Usage#Datasets|Datasets]] are the principal interface to data, but are abstract classes |
| ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
| | * [[Fuel/Usage#IterableDataset Example|IterableDatasets]] (less powerful) allow sequential access to data in specified order only |
| ...: ('targets', ('batch', 'index'))]))
| | * [[Fuel/Usage#IndexableDataset Example|IndexableDatasets]] (more powerful) allow random access to data |
| | | * [[Fuel/Usage#Iteration Schemes|Schemes]] allow iterating through IndexablelDatasets in various orders (batch, sequential, shuffle, etc.) |
| ~~~*~*~*~*~*~*~~~ end flashback ~~~*~*~*~*~*~*~~~
| |
| </pre>
| |
| | |
| However, we created an IterableDataset, not a Dataset.
| |
| | |
| This matters because we are going to be modifying the call to get_data(), and for an IterableDataset, there is a predefined order in which get_data() operates - so it doesn't accept any extra arguments.
| |
| | |
| If we ignore that fact, and incorrectly try and iterate over the IterableDataset in a custom order, we get a ValueError:
| |
| | |
| <pre>
| |
| In [23]: from fuel.schemes import ShuffledScheme
| |
| | |
| In [24]: state = dataset.open()
| |
| | |
| In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
| |
| | |
| In [26]: for request in scheme.get_request_iterator():
| |
| ...: data = dataset.get_data(state=state, request=request)
| |
| ...: print(data[0].shape, data[1].shape)
| |
| ...:
| |
| ---------------------------------------------------------------------------
| |
| ValueError Traceback (most recent call last)
| |
| <ipython-input-27-24827dafdaa8> in <module>()
| |
| 1 for request in scheme.get_request_iterator():
| |
| ----> 2 data = dataset.get_data(state=state, request=request)
| |
| 3 print(data[0].shape, data[1].shape)
| |
| 4
| |
| | |
| /usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
| |
| 310 def get_data(self, state=None, request=None):
| |
| 311 if state is None or request is not None:
| |
| --> 312 raise ValueError
| |
| 313 return next(state)
| |
| 314
| |
| | |
| ValueError:
| |
| </pre>
| |
| | |
| ====Correct Usage====
| |
| | |
| We'll need to re-create our dataset, this time using an IndexableDataset object.
| |
|
| |
|
| =Wrapping Custom Datasets with Fuel= | | =Wrapping Custom Datasets with Fuel= |
|
| |
|
| Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel
| | {{Main|Fuel/Custom Datasets}} |
| | |
| Advantages:
| |
| * Only takes one command to download the data and import it into fuel
| |
| * Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y
| |
| | |
| Disadvantages:
| |
| * One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
| |
| * Complicated to extend
| |
| * Removes some of the nicer options of fuel
| |
| | |
| Here is what the final payoff looks like:
| |
| | |
| <pre>
| |
| from keras.models import Sequential
| |
| from lfw_fuel import lfw
| |
|
| |
|
| # the data, shuffled and split between train and test sets
| | Basically, the process of wrapping a custom data set with fuel looks like this: |
| (X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled")
| | * Specify how the original data should be downloaded, processed, and turned into a fuel data set |
| | * Specify how the fuel data set should be loaded |
|
| |
|
| # (build the perfect model here)
| | The first step - defining how to turn original data into fuel data: |
| | * Create a download wrapper - this tells fuel how to download the original data ("briq" download?) |
| | * Define a way to load a single piece of data (e.g., parameterized by name) and, optionally, paired/related pieces of data (e.g., two related images) |
| | * Convert function to extract all data and assemble it all into an HDF5 file (and remove original data when finished) |
|
| |
|
| model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
| | The second step - specifying how the fuel data set should be loaded: |
| score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
| | * Create a fuel Datasets object (inheriting from, e.g., H5PYDataset) |
| </pre>
| | * Define a way for that data to be loaded (example: make a universally-available load_data method in a package specific to your data set, as in lfw_fuel) |
|
| |
|
| =Flags= | | =Flags= |
|
| |
|
| | | {{FuelFlag}} |
| [[Category:Data Engineering]]
| |
| [[Category:NN]]
| |
| [[Category:ML]]
| |