Data Analysis Using Modern Python

Mika Pflüger

2018-06-01

whetting your appetite: smart indexing

In [1]:
%matplotlib inline
import pandas as pd

measurement = pd.Series.from_csv('measurement.csv', header=0)
measurement.plot()
Out[1]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fde1d5f40b8>
In [2]:
reference = pd.Series.from_csv('reference.csv')
reference.plot()
measurement.plot()
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fddf30da7b8>
In [3]:
norm_measurement = measurement / reference
norm_measurement.plot()
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fddf2fdb438>

whetting your appetite: Gaußian uncertainty propagation & units

In [6]:
from uncertainties import ufloat
import pint
u = pint.UnitRegistry()

l = ufloat(2, 0.2, tag='length') * u.m
t = ufloat(7.8, 0.002, tag='time') * u.s
v = l/t
v.to(u.km/u.hour)
v.magnitude.error_components()
Out[6]:
{< length = 2.0+/-0.2 >: 0.025641025641025647,
 < time = 7.8+/-0.002 >: 6.574621959237344e-05}
In [7]:
l - l
Out[7]:
0.0+/-0 meter
In [8]:
l + 2 *u.inch
Out[8]:
2.05+/-0.20 meter
In [9]:
l - 2 * u.s
---------------------------------------------------------------------------
DimensionalityError                       Traceback (most recent call last)
<ipython-input-9-3b1b9c0dd801> in <module>()
----> 1 l - 2 * u.s

/usr/lib/python3/dist-packages/pint/quantity.py in __sub__(self, other)
    596 
    597     def __sub__(self, other):
--> 598         return self._add_sub(other, operator.sub)
    599 
    600     def __rsub__(self, other):

/usr/lib/python3/dist-packages/pint/quantity.py in _add_sub(self, other, op)
    507             raise DimensionalityError(self._units, other._units,
    508                                       self.dimensionality,
--> 509                                       other.dimensionality)
    510 
    511         # Next we define some variables to make if-clauses more readable.

DimensionalityError: Cannot convert from 'meter' ([length]) to 'second' ([time])

what is python / why this talk?

  • scripting language: rapid prototyping
  • white space is important
  • core python language + ecosystem

Basics: importing modules

In [ ]:
import module / from module import thing
In [10]:
# example
import numpy as np
print(np .pi)

from numpy import pi
print(pi)
3.141592653589793
3.141592653589793

Basics: data types

In [11]:
# strings
s = "hi\nthere"
print(s)
hi
there
In [12]:
# double or single quotes
s = 'hi\nthere'
print(s)
hi
there
In [13]:
# triple quotes for easy multilines
s = """hi
there"""
print(s)
hi
there
In [14]:
# integers, floats, complex numbers
i = 2
f = 3.6
c = 3.6 + 5.2j
i, f, c
Out[14]:
(2, 3.6, (3.6+5.2j))
In [16]:
# lists
l = ['a', 2.3, 5]
l[1]
Out[16]:
2.3
In [17]:
# dictionaries (maps)
d = {"a": 2.5,
     "b": 3.0,
     "c": 15}
d["d"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-750223f6d9a5> in <module>()
      3      "b": 3.0,
      4      "c": 15}
----> 5 d["d"]

KeyError: 'd'
In [18]:
d["d"] = 2
d["d"]
Out[18]:
2
In [30]:
# pandas DataFrame
import pandas as pd
df = pd.read_csv('./example.dat', header=201, sep='\t', index_col=0)
df
Out[30]:
2Theta Keysight1 Ring_1 PosCountTimer
# Theta
0.00 0.00 7.751940e-08 247.803183 6494
0.01 0.02 7.845600e-08 247.792078 7729
0.02 0.04 7.922830e-08 247.781630 8882
0.03 0.06 8.029980e-08 247.771067 10034
0.04 0.08 8.161357e-08 247.758052 11374
0.05 0.10 8.323400e-08 247.747014 12579
0.06 0.12 8.541940e-08 247.733945 13866
0.07 0.14 8.721787e-08 247.724389 15071
0.08 0.16 8.922213e-08 247.711872 16411
0.09 0.18 9.130493e-08 247.701092 17578
0.10 0.20 9.337193e-08 247.689972 18867
0.11 0.22 9.528150e-08 247.679096 20019
0.12 0.24 9.691873e-08 247.665657 21411
0.13 0.26 9.851240e-08 247.655321 22580
0.14 0.28 9.995727e-08 247.643026 23869
0.15 0.30 1.013926e-07 247.632124 25021
0.16 0.32 1.027649e-07 247.620359 26413
0.17 0.34 1.040000e-07 247.609287 27616
0.18 0.36 1.043978e-07 247.597854 28894
0.19 0.38 1.049485e-07 247.585743 30091
0.20 0.40 1.059835e-07 247.573420 31415
0.21 0.42 1.062361e-07 247.562737 32618
0.22 0.44 1.068761e-07 247.550808 33923
0.23 0.46 1.075552e-07 247.540196 35092
0.24 0.48 1.090245e-07 247.528460 36415
0.25 0.50 1.102908e-07 247.517202 37619
0.26 0.52 1.111454e-07 247.505716 38925
0.27 0.54 1.120614e-07 247.493667 40128
0.28 0.56 1.088269e-07 247.483039 41383
0.29 0.58 1.074410e-07 247.473293 42534
... ... ... ... ...
9.72 19.44 4.035257e-13 247.828837 1260176
9.73 19.46 4.085123e-13 247.818135 1261327
9.74 19.48 4.124470e-13 247.807334 1262549
9.75 19.50 4.096387e-13 247.805583 1263699
9.76 19.52 4.072923e-13 247.785821 1264850
9.77 19.54 3.890220e-13 247.773549 1266191
9.78 19.56 3.850197e-13 247.763571 1267343
9.79 19.58 3.875630e-13 247.761068 1268684
9.80 19.60 3.714487e-13 247.749753 1269921
9.81 19.62 3.572130e-13 247.728133 1271193
9.82 19.64 3.735373e-13 247.717709 1272344
9.83 19.66 3.790957e-13 247.714039 1273737
9.84 19.68 4.007653e-13 247.703636 1274939
9.85 19.70 4.174377e-13 247.691288 1276195
9.86 19.72 3.989240e-13 247.681306 1277346
9.87 19.74 3.883707e-13 247.660047 1278688
9.88 19.76 3.672517e-13 247.657634 1279890
9.89 19.78 3.564300e-13 247.644066 1281231
9.90 19.80 3.633200e-13 247.635210 1282399
9.91 19.82 3.899460e-13 247.623047 1283705
9.92 19.84 3.793430e-13 247.612166 1284856
9.93 19.86 3.635057e-13 247.590814 1286197
9.94 19.88 3.584580e-13 247.589866 1287349
9.95 19.90 3.358127e-13 247.578697 1288500
9.96 19.92 3.306247e-13 247.568053 1289669
9.97 19.94 3.326100e-13 247.557524 1290872
9.98 19.96 3.306003e-13 247.528991 1292058
9.99 19.98 3.336347e-13 247.519405 1293209
10.00 20.00 3.305060e-13 247.525609 1294446
10.00 20.00 NaN NaN 1295531

1002 rows × 4 columns

In [20]:
df.loc[0.5:1.2]
Out[20]:
2Theta Keysight1 Ring_1 PosCountTimer
# Theta
0.50 1.00 9.312017e-08 247.234959 68539
0.51 1.02 9.292613e-08 247.222391 69878
0.52 1.04 9.288873e-08 247.210314 71143
0.53 1.06 9.232707e-08 248.215130 73935
0.54 1.08 9.156503e-08 248.194863 75138
0.55 1.10 9.054213e-08 248.183726 76388
0.56 1.12 8.932890e-08 248.173093 77544
0.57 1.14 8.772530e-08 248.170120 78884
0.58 1.16 8.560990e-08 248.159588 80035
0.59 1.18 8.291830e-08 248.137058 81428
0.60 1.20 7.942613e-08 248.127586 82598
0.61 1.22 7.482373e-08 248.124554 83938
0.62 1.24 6.865640e-08 248.112362 85141
0.63 1.26 6.090807e-08 248.101956 86395
0.64 1.28 5.198290e-08 248.081511 87547
0.65 1.30 4.356907e-08 248.078455 88887
0.66 1.32 3.685230e-08 248.067541 90091
0.67 1.34 3.178583e-08 248.054956 91431
0.68 1.36 2.778277e-08 248.044762 92634
0.69 1.38 2.455030e-08 248.032823 93940
0.70 1.40 2.206610e-08 248.012145 95143
0.71 1.42 1.986737e-08 248.000759 96398
0.72 1.44 1.782372e-08 247.999821 97579
0.73 1.46 1.621354e-08 247.986872 98890
0.74 1.48 1.492485e-08 247.965814 100094
0.75 1.50 1.369708e-08 247.953848 101434
0.76 1.52 1.246440e-08 247.944086 102603
0.77 1.54 1.134819e-08 247.940428 103908
0.78 1.56 1.048855e-08 247.920800 105060
0.79 1.58 9.761897e-09 247.910522 106263
... ... ... ... ...
0.91 1.82 4.015940e-09 247.784588 121112
0.92 1.84 3.756217e-09 247.762340 122418
0.93 1.86 3.473910e-09 247.761459 123570
0.94 1.88 3.214327e-09 247.738015 124946
0.95 1.90 3.017150e-09 247.728381 126114
0.96 1.92 2.880703e-09 247.715590 127438
0.97 1.94 2.763130e-09 247.706739 128641
0.98 1.96 2.628137e-09 247.703943 129947
0.99 1.98 2.453100e-09 247.683214 131151
1.00 2.00 2.270293e-09 247.671169 132404
1.01 2.02 2.112197e-09 247.661022 133556
1.02 2.04 2.003927e-09 247.658265 134897
1.03 2.06 1.931747e-09 247.646875 136100
1.04 2.08 1.866501e-09 247.626023 137406
1.05 2.10 1.774149e-09 247.614537 138575
1.06 2.12 1.653561e-09 247.613446 139747
1.07 2.14 1.527938e-09 247.602115 140982
1.08 2.16 1.426483e-09 247.591290 142235
1.09 2.18 1.361204e-09 247.570827 143387
1.10 2.20 1.322285e-09 247.559625 144556
1.11 2.22 1.283407e-09 247.549100 145706
1.12 2.24 1.222888e-09 247.546926 146910
1.13 2.26 1.139258e-09 247.536231 148062
1.14 2.28 1.052174e-09 247.524500 149403
1.15 2.30 9.797617e-10 247.504664 150606
1.16 2.32 9.359380e-10 247.492633 151894
1.17 2.34 9.131220e-10 247.491293 153098
1.18 2.36 8.918657e-10 247.479770 154403
1.19 2.38 8.571600e-10 247.467412 155640
1.20 2.40 8.007293e-10 247.455500 156947

71 rows × 4 columns

In [31]:
df.iloc[1:3]
Out[31]:
2Theta Keysight1 Ring_1 PosCountTimer
# Theta
0.01 0.02 7.845600e-08 247.792078 7729
0.02 0.04 7.922830e-08 247.781630 8882
In [32]:
df.plot(y="Keysight1", logy=True)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fddf2735eb8>
In [38]:
# pandas Series
ser = df.iloc[:-3]
ser
Out[38]:
2Theta Keysight1 Ring_1 PosCountTimer
# Theta
0.00 0.00 7.751940e-08 247.803183 6494
0.01 0.02 7.845600e-08 247.792078 7729
0.02 0.04 7.922830e-08 247.781630 8882
0.03 0.06 8.029980e-08 247.771067 10034
0.04 0.08 8.161357e-08 247.758052 11374
0.05 0.10 8.323400e-08 247.747014 12579
0.06 0.12 8.541940e-08 247.733945 13866
0.07 0.14 8.721787e-08 247.724389 15071
0.08 0.16 8.922213e-08 247.711872 16411
0.09 0.18 9.130493e-08 247.701092 17578
0.10 0.20 9.337193e-08 247.689972 18867
0.11 0.22 9.528150e-08 247.679096 20019
0.12 0.24 9.691873e-08 247.665657 21411
0.13 0.26 9.851240e-08 247.655321 22580
0.14 0.28 9.995727e-08 247.643026 23869
0.15 0.30 1.013926e-07 247.632124 25021
0.16 0.32 1.027649e-07 247.620359 26413
0.17 0.34 1.040000e-07 247.609287 27616
0.18 0.36 1.043978e-07 247.597854 28894
0.19 0.38 1.049485e-07 247.585743 30091
0.20 0.40 1.059835e-07 247.573420 31415
0.21 0.42 1.062361e-07 247.562737 32618
0.22 0.44 1.068761e-07 247.550808 33923
0.23 0.46 1.075552e-07 247.540196 35092
0.24 0.48 1.090245e-07 247.528460 36415
0.25 0.50 1.102908e-07 247.517202 37619
0.26 0.52 1.111454e-07 247.505716 38925
0.27 0.54 1.120614e-07 247.493667 40128
0.28 0.56 1.088269e-07 247.483039 41383
0.29 0.58 1.074410e-07 247.473293 42534
... ... ... ... ...
9.69 19.38 3.899333e-13 247.863983 1256395
9.70 19.40 3.995507e-13 247.850368 1257684
9.71 19.42 4.006050e-13 247.849418 1258887
9.72 19.44 4.035257e-13 247.828837 1260176
9.73 19.46 4.085123e-13 247.818135 1261327
9.74 19.48 4.124470e-13 247.807334 1262549
9.75 19.50 4.096387e-13 247.805583 1263699
9.76 19.52 4.072923e-13 247.785821 1264850
9.77 19.54 3.890220e-13 247.773549 1266191
9.78 19.56 3.850197e-13 247.763571 1267343
9.79 19.58 3.875630e-13 247.761068 1268684
9.80 19.60 3.714487e-13 247.749753 1269921
9.81 19.62 3.572130e-13 247.728133 1271193
9.82 19.64 3.735373e-13 247.717709 1272344
9.83 19.66 3.790957e-13 247.714039 1273737
9.84 19.68 4.007653e-13 247.703636 1274939
9.85 19.70 4.174377e-13 247.691288 1276195
9.86 19.72 3.989240e-13 247.681306 1277346
9.87 19.74 3.883707e-13 247.660047 1278688
9.88 19.76 3.672517e-13 247.657634 1279890
9.89 19.78 3.564300e-13 247.644066 1281231
9.90 19.80 3.633200e-13 247.635210 1282399
9.91 19.82 3.899460e-13 247.623047 1283705
9.92 19.84 3.793430e-13 247.612166 1284856
9.93 19.86 3.635057e-13 247.590814 1286197
9.94 19.88 3.584580e-13 247.589866 1287349
9.95 19.90 3.358127e-13 247.578697 1288500
9.96 19.92 3.306247e-13 247.568053 1289669
9.97 19.94 3.326100e-13 247.557524 1290872
9.98 19.96 3.306003e-13 247.528991 1292058

999 rows × 4 columns

In [39]:
pd.Series([1.2, 2.2, 3.6], index=[3300, 3301, 3302])
Out[39]:
3300    1.2
3301    2.2
3302    3.6
dtype: float64

understanding "whetting your appetite: smart indexing"

In [ ]:
import pandas as pd

measurement = pd.Series.from_csv('measurement.csv', header=0)
reference = pd.Series.from_csv('reference.csv')
norm_measurement = measurement / reference

measurement.plot()
norm_measurement.plot()

digression: "dumb" arrays

In [42]:
l = [2, 4, 6, 8]
l/2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-ae542751842f> in <module>()
      1 l = [2, 4, 6, 8]
----> 2 l/2

TypeError: unsupported operand type(s) for /: 'list' and 'int'
In [ ]:
import numpy as np
a = np.array(l)
a/2
In [46]:
3%2
Out[46]:
1

  • list vs. np.array: better math, same dtype
  • np.array vs pd.Series/pd.DataFrame: smart indexing, plotting, other convenience

=> if in doubt, use pd.Series/pd.DataFrame

More core python: functions

In [47]:
def sq_half(x):
    return x**2 / 2

sq_half(4)
Out[47]:
8.0
In [58]:
def rms(x, y=2):
    """root mean square of x and y"""
    return np.sqrt((x**2 + y**2) / 2), x, y

a, b, c = rms(3)

More core python: classes + objects

In [59]:
import numpy as np
class Point:
    """A 2d point"""
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def distance(self, other):
        """Euclidean distance to the other point."""
        return np.sqrt( (self.x - other.x)**2 + (self.y - other.y)**2 )
    
    
p = Point(2, 3)
q = Point(3, 7)
p.distance(q)
Out[59]:
4.1231056256176606
  • use classes to group related data and functionality

understanding "whetting your appetite: uncertainties + units"

In [60]:
from uncertainties import ufloat
import pint
u = pint.UnitRegistry()

l = ufloat(2, 0.2, tag='length') * u.m
t = ufloat(7.8, 0.002, tag='time') * u.s
v = l/t
type(v)
Out[60]:
pint.unit.build_quantity_class.<locals>.Quantity
In [61]:
type(v.magnitude)
Out[61]:
uncertainties.AffineScalarFunc
In [63]:
v.magnitude.error_components()
Out[63]:
{< length = 2.0+/-0.2 >: 0.025641025641025647,
 < time = 7.8+/-0.002 >: 6.574621959237344e-05}
  • ufloat objects grouping magnitude and uncertainty, quantity objects grouping magnitude and unit
  • composing this leads to easy-to-use calculator

Last important bit of syntax: loops and conditionals

In [ ]:
for x in iterable:
    do_stuff
In [65]:
l = ["a", "b", 15]
for x in l:
    print(x)
a
b
15
In [ ]:
# to count (like loop in C/C++)
for i in range(5):
    print(i)
    
In [69]:
range(100000)
Out[69]:
range(0, 100000)
In [ ]:
# conditionals
In [ ]:
if thing:
    do_stuff
In [75]:
a = 2
if a < 0:
    a = a * -1
elif a < 2:
    print('haha')
print(a)
haha
2
In [73]:
# equivalent
a = abs(a)

Getting help

In [ ]:
thing?
In [79]:
rms??
  • google: python + thing
  • prefer readthedocs.org and docs.python.org
  • show pint docs

Environment

  • ipython: interactive terminal
  • jupyter: interactive notebooks in the browser (and slide shows)
  • pycharm: IDE with debugger, syntax checks etc.
  • conda: binary package manager for python: stable, easy installation

Tour of the ecosystem

core stdlib

most useful usually:

  • os.path
  • logging
  • subprocess

worth a look if you search something

In [ ]:
import subprocess
subprocess.run(['ls', '-la'])
In [81]:
!ls -la
insgesamt 616
drwxr-xr-x  3 mikapfl mikapfl   4096 Jun  1 15:03 .
drwxr-xr-x 30 mikapfl mikapfl   4096 Jun  1 11:15 ..
-rw-r--r--  1 mikapfl mikapfl 144020 Jun  1 15:03 data-analysis-python.ipynb
-rw-r--r--  1 mikapfl mikapfl  87376 Jun  1 13:14 example.dat
-rw-r--r--  1 mikapfl mikapfl  87418 Jun  1 13:03 example.dat~
drwxr-xr-x  2 mikapfl mikapfl   4096 Jun  1 11:15 .ipynb_checkpoints
-rw-r--r--  1 mikapfl mikapfl     53 Jun  1 11:40 matplotlibrc
-rw-r--r--  1 mikapfl mikapfl     23 Jun  1 11:33 .matplotlibrc~
-rw-r--r--  1 mikapfl mikapfl     24 Jun  1 11:34 matplotlibrc~
-rw-r--r--  1 mikapfl mikapfl  40455 Jun  1 11:24 measurement.csv
-rw-r--r--  1 mikapfl mikapfl  40438 Jun  1 11:12 measurement.csv~
-rw-r--r--  1 mikapfl mikapfl 193613 Jun  1 11:24 reference.csv

numpy

In [ ]:
import numpy as np

scipy

In [ ]:
import scipy as sp

pandas

In [ ]:
import pandas as pd

PTB-interesting

plotting

faster computation

more

suggested literature