Pandas: Difference between revisions
From charlesreid1
No edit summary |
No edit summary |
||
| Line 20: | Line 20: | ||
$ sudo easy_install pip | $ sudo easy_install pip | ||
$ sudo pip install numpy | $ sudo pip install numpy | ||
$ sudo pip install numexpr | $ sudo pip install numexpr | ||
$ sudo pip install tables | $ sudo pip install tables | ||
$ sudo pip install pandas | |||
</source> | </source> | ||
Revision as of 17:56, 23 July 2013
Installing
For some reason, my installation process was a real pain to get working, but in the end it all turned out to be these phantom problems, that would shift, and would act really weird, and probably had to do with a crusty Bash $PYTHONPATH variable or something.
My easy_install and pip were both throwing errors trying to install pandas, due to lacking prerequisites. One had a problem with a too-old version of numpy (not the most recent version that I'd built myself, which is why I think there was some incorrect Python version being sucked in or something). I was able to use Virtualenv to debug some of the problems, but basically cleared out my $PYTHONPATH, moved my custom-installed numpy off of my PYTHONPATH, breaking my version of scipy and ipython and matplotlib as well. So, I had to use pip to re-install pip's version of all of these. Installing these ended up not working, however, and when I put my old, by-hand installations back on my PYTHONPATH, installing pandas, pytables, and numexpr with pip all went smoothly.
So, in the end, I don't actually know what the right procedure is; I just have a vague sense that there were some problems, that got resolved, by something I did, at some point.
First, I downloaded and installed easy_install from source.
Then, I blasted my PYTHONPATH:
$ unset PYTHONPATH
Then, I ran the following commands:
$ sudo easy_install pip
$ sudo pip install numpy
$ sudo pip install numexpr
$ sudo pip install tables
$ sudo pip install pandas
Data
Creating a Table of Arbitrary Data Types
Let's say you're trying to create a data table where you store the result of a simulation. This simulation has a set of inputs and outputs, each with a different data type. For example, the following inputs are scalars:
- Flowrate_in (float)
- Temperature_in (float)
- Pressure_in (float)
But temperature and species profiles are vectors, not scalars:
- Temperature_profile (numpy array)
- Oxygen_profile (numpy array)
Two ways of populating a Pandas data object (a DataFrame, in this case) are:
- Create arbitrary, concrete data with the type you are interested in storing
- Grab the types of the data you are interested in storing
Initializing with Data
A simple illustration of the first technique:
In[99]: reactors = [ { "flowrate_in" : 0.0, "temperature_in" : 0.0, "pressure_in" : 0.0, "temperature_profile" : zeros(100,), "oxygen_profile" : zeros(100,) } for i in arange(10) ]
This creates a list of 10 dicts containing the same initial values, which can then be used to initialize a DataFrame object:
In[100]: pandas.DataFrame(reactors)
Out[100]:
flowrate_in oxygen_profile pressure_in temperature_in \
0 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
1 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
2 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
3 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
4 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
5 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
6 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
7 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
8 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
9 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
temperature_profile
0 [0.0, 0.0, 0.0, 0.0, 0.0]
1 [0.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0]
3 [0.0, 0.0, 0.0, 0.0, 0.0]
4 [0.0, 0.0, 0.0, 0.0, 0.0]
5 [0.0, 0.0, 0.0, 0.0, 0.0]
6 [0.0, 0.0, 0.0, 0.0, 0.0]
7 [0.0, 0.0, 0.0, 0.0, 0.0]
8 [0.0, 0.0, 0.0, 0.0, 0.0]
9 [0.0, 0.0, 0.0, 0.0, 0.0]
Initializing with Types
A simple illustration of the second technique:
In[101]: df = reactors = [ { "flowrate_in" : numpy.float32, "temperature_in" : numpy.float32, "pressure_in" : numpy.float32, "temperature_profile" : numpy.ndarray, "oxygen_profile" : numpy.ndarray } for i in range(10) ]
This creates a list of 10 dicts that are all empty:
In[102]: df = pandas.DataFrame(reactors)
Out[102]:
flowrate_in oxygen_profile pressure_in \
0 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
1 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
2 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
3 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
4 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
5 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
6 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
7 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
8 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
9 <type 'numpy.float32'> <type 'numpy.ndarray'> <type 'numpy.float32'>
temperature_in temperature_profile
0 <type 'numpy.float32'> <type 'numpy.ndarray'>
1 <type 'numpy.float32'> <type 'numpy.ndarray'>
2 <type 'numpy.float32'> <type 'numpy.ndarray'>
3 <type 'numpy.float32'> <type 'numpy.ndarray'>
4 <type 'numpy.float32'> <type 'numpy.ndarray'>
5 <type 'numpy.float32'> <type 'numpy.ndarray'>
6 <type 'numpy.float32'> <type 'numpy.ndarray'>
7 <type 'numpy.float32'> <type 'numpy.ndarray'>
8 <type 'numpy.float32'> <type 'numpy.ndarray'>
9 <type 'numpy.float32'> <type 'numpy.ndarray'>
Modifying a Table with Data
When you treat data as a 2D array of arbitrary data types, each of those numpy.ndarray objects can be whatever size it wants - all that Pandas cares about is the fact that it is a numpy array. Beyond that, Pandas doesn't care about the shape or size of the array.
This means that, in practice, you could have temperature or oxygen profiles of entirely different sizes:
In [117]: df
Out[117]:
flowrate_in oxygen_profile pressure_in temperature_in \
0 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
1 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
2 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
3 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
4 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
5 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
6 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
7 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
8 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
9 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
temperature_profile
0 [0.0, 0.0, 0.0, 0.0, 0.0]
1 [0.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0]
3 [0.0, 0.0, 0.0, 0.0, 0.0]
4 [0.0, 0.0, 0.0, 0.0, 0.0]
5 [0.0, 0.0, 0.0, 0.0, 0.0]
6 [0.0, 0.0, 0.0, 0.0, 0.0]
7 [0.0, 0.0, 0.0, 0.0, 0.0]
8 [0.0, 0.0, 0.0, 0.0, 0.0]
9 [0.0, 0.0, 0.0, 0.0, 0.0]
Now set the temperature profiles to be profiles of different lengths:
In [122]: df['temperature_profile'][0] = 25*ones(3,)
In [123]: df['temperature_profile'][1] = 28*ones(5,)
In [124]: df['temperature_profile'][2] = 30*ones(8,)
In [125]: df
Out[125]:
flowrate_in oxygen_profile pressure_in temperature_in \
0 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
1 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
2 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
3 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
4 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
5 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
6 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
7 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
8 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
9 0 [0.0, 0.0, 0.0, 0.0, 0.0] 0 0
temperature_profile
0 [25.0, 25.0, 25.0]
1 [28.0, 28.0, 28.0, 28.0, 28.0]
2 [30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0]
3 [0.0, 0.0, 0.0, 0.0, 0.0]
4 [0.0, 0.0, 0.0, 0.0, 0.0]
5 [0.0, 0.0, 0.0, 0.0, 0.0]
6 [0.0, 0.0, 0.0, 0.0, 0.0]
7 [0.0, 0.0, 0.0, 0.0, 0.0]
8 [0.0, 0.0, 0.0, 0.0, 0.0]
9 [0.0, 0.0, 0.0, 0.0, 0.0]
Saving a Table with Data
H5
To save a DataFrame using HDF5:
df.to_hdf('dummy.h5','name_of_array',append=False)
df_h5 = pandas.read_hdf('dummy.h5', 'name_of_array')
CSV
df.to_csv('dummy.csv')
df_csv = pandas.read_csv('dummy.csv')