1 files changed, 268 insertions, 0 deletions
diff --git a/circuitpython/extmod/ulab/docs/manual/source/ulab-tricks.rst b/circuitpython/extmod/ulab/docs/manual/source/ulab-tricks.rst
new file mode 100644
index 0000000..4c3802b
--- /dev/null
+++ b/circuitpython/extmod/ulab/docs/manual/source/ulab-tricks.rst
@@ -0,0 +1,268 @@
+
+Tricks
+======
+
+This section of the book discusses a couple of tricks that can be
+exploited to either speed up computations, or save on RAM. However,
+there is probably no silver bullet, and you have to evaluate your code
+in terms of execution speed (if the execution is time critical), or RAM
+used. You should also keep in mind that, if a particular code snippet is
+optimised on some hardware, there is no guarantee that on another piece
+of hardware, you will get similar improvements. Hardware implementations
+are vastly different. Some microcontrollers do not even have an FPU, so
+you should not be surprised that you get significantly different
+benchmarks. Just to underline this statement, you can study the
+`collection of benchmarks <https://github.com/thiagofe/ulab_samples>`__.
+
+Use an ``ndarray``, if you can
+------------------------------
+
+Many functions in ``ulab`` are implemented in a universal fashion,
+meaning that both generic ``micropython`` iterables, and ``ndarray``\ s
+can be passed as an argument. E.g., both
+
+.. code:: python
+
+   from ulab import numpy as np
+
+   np.sum([1, 2, 3, 4, 5])
+
+and
+
+.. code:: python
+
+   from ulab import numpy as np
+
+   a = np.array([1, 2, 3, 4, 5])
+   np.sum(a)
+
+will return the ``micropython`` variable 15 as the result. Still,
+``np.sum(a)`` is evaluated significantly faster, because in
+``np.sum([1, 2, 3, 4, 5])``, the interpreter has to fetch 5
+``micropython`` variables, convert them to ``float``, and sum the
+values, while the C type of ``a`` is known, thus the interpreter can
+invoke a single ``for`` loop for the evaluation of the ``sum``. In the
+``for`` loop, there are no function calls, the iteration simply walks
+through the pointer holding the values of ``a``, and adds the values to
+an accumulator. If the array ``a`` is already available, then you can
+gain a factor of 3 in speed by calling ``sum`` on the array, instead of
+using the list. Compared to the python implementation of the same
+functionality, the speed-up is around 40 (again, this might depend on
+the hardware).
+
+On the other hand, if the array is not available, then there is not much
+point in converting the list to an ``ndarray`` and passing that to the
+function. In fact, you should expect a slow-down: the constructor has to
+iterate over the list elements, and has to convert them to a numerical
+type. On top of that, it also has to reserve RAM for the ``ndarray``.
+
+Use a reasonable ``dtype``
+--------------------------
+
+Just as in ``numpy``, the default ``dtype`` is ``float``. But this does
+not mean that that is the most suitable one in all scenarios. If data
+are streamed from an 8-bit ADC, and you only want to know the maximum,
+or the sum, then it is quite reasonable to use ``uint8`` for the
+``dtype``. Storing the same data in ``float`` array would cost 4 or 8
+times as much RAM, with absolutely no gain. Do not rely on the default
+value of the constructor’s keyword argument, and choose one that fits!
+
+Beware the axis!
+----------------
+
+Whenever ``ulab`` iterates over multi-dimensional arrays, the outermost
+loop is the first axis, then the second axis, and so on. E.g., when the
+``sum`` of
+
+.. code:: python
+
+   a = array([[1, 2, 3, 4],
+              [5, 6, 7, 8], 
+              [9, 10, 11, 12]], dtype=uint8)
+
+is being calculated, first the data pointer walks along ``[1, 2, 3, 4]``
+(innermost loop, last axis), then is moved back to the position, where 5
+is stored (this is the nesting loop), and traverses ``[5, 6, 7, 8]``,
+and so on. Moving the pointer back to 5 is more expensive, than moving
+it along an axis, because the position of 5 has to be calculated,
+whereas moving from 5 to 6 is simply an addition to the address. Thus,
+while the matrix
+
+.. code:: python
+
+   b = array([[1, 5, 9],
+              [2, 6, 10], 
+              [3, 7, 11],
+              [4, 8, 12]], dtype=uint8)
+
+holds the same data as ``a``, the summation over the entries in ``b`` is
+slower, because the pointer has to be re-wound three times, as opposed
+to twice in ``a``. For small matrices the savings are not significant,
+but you would definitely notice the difference, if you had
+
+::
+
+   a = array(range(2000)).reshape((2, 1000))
+   b = array(range(2000)).reshape((1000, 2))
+
+The moral is that, in order to improve on the execution speed, whenever
+possible, you should try to make the last axis the longest. As a side
+note, ``numpy`` can re-arrange its loops, and puts the longest axis in
+the innermost loop. This is why the longest axis is sometimes referred
+to as the fast axis. In ``ulab``, the order of the axes is fixed.
+
+Reduce the number of artifacts
+------------------------------
+
+Before showing a real-life example, let us suppose that we want to
+interpolate uniformly sampled data, and the absolute magnitude is not
+really important, we only care about the ratios between neighbouring
+value. One way of achieving this is calling the ``interp`` functions.
+However, we could just as well work with slices.
+
+.. code::
+
+    # code to be run in CPython
+    
+    a = array([0, 10, 2, 20, 4], dtype=np.uint8)
+    b = np.zeros(9, dtype=np.uint8)
+    
+    b[::2] = 2 * a
+    b[1::2] = a[:-1] + a[1:]
+    
+    b //= 2
+    b
+
+
+
+.. parsed-literal::
+
+    array([ 0,  5, 10,  6,  2, 11, 20, 12,  4], dtype=uint8)
+
+
+
+``b`` now has values from ``a`` at every even position, and interpolates
+the values on every odd position. If only the relative magnitudes are
+important, then we can even save the division by 2, and we end up with
+
+.. code::
+
+    # code to be run in CPython
+    
+    a = array([0, 10, 2, 20, 4], dtype=np.uint8)
+    b = np.zeros(9, dtype=np.uint8)
+    
+    b[::2] = 2 * a
+    b[1::2] = a[:-1] + a[1:]
+    
+    b
+
+
+
+.. parsed-literal::
+
+    array([ 0, 10, 20, 12,  4, 22, 40, 24,  8], dtype=uint8)
+
+
+
+Importantly, we managed to keep the results in the smaller ``dtype``,
+``uint8``. Now, while the two assignments above are terse and pythonic,
+the code is not the most efficient: the right hand sides are compound
+statements, generating intermediate results. To store them, RAM has to
+be allocated. This takes time, and leads to memory fragmentation. Better
+is to write out the assignments in 4 instructions:
+
+.. code::
+
+    # code to be run in CPython
+    
+    b = np.zeros(9, dtype=np.uint8)
+    
+    b[::2] = a
+    b[::2] += a
+    b[1::2] = a[:-1]
+    b[1::2] += a[1:]
+    
+    b
+
+
+
+.. parsed-literal::
+
+    array([ 0, 10, 20, 12,  4, 22, 40, 24,  8], dtype=uint8)
+
+
+
+The results are the same, but no extra RAM is allocated, except for the
+views ``a[:-1]``, and ``a[1:]``, but those had to be created even in the
+origin implementation.
+
+Upscaling images
+~~~~~~~~~~~~~~~~
+
+And now the example: there are low-resolution thermal cameras out there.
+Low resolution might mean 8 by 8 pixels. Such a small number of pixels
+is just not reasonable to plot, no matter how small the display is. If
+you want to make the camera image a bit more pleasing, you can upscale
+(stretch) it in both dimensions. This can be done exactly as we
+up-scaled the linear array:
+
+.. code::
+
+    # code to be run in CPython
+    
+    b = np.zeros((15, 15), dtype=np.uint8)
+    
+    b[1::2,::2] = a[:-1,:]
+    b[1::2,::2] += a[1:, :]
+    b[1::2,::2] //= 2
+    b[::,1::2] = a[::,:-1:2]
+    b[::,1::2] += a[::,2::2]
+    b[::,1::2] //= 2
+Up-scaling by larger numbers can be done in a similar fashion, you
+simply have more assignments.
+
+There are cases, when one cannot do away with the intermediate results.
+Two prominent cases are the ``where`` function, and indexing by means of
+a Boolean array. E.g., in
+
+.. code::
+
+    # code to be run in CPython
+    
+    a = array([1, 2, 3, 4, 5])
+    b = a[a < 4]
+    b
+
+
+
+.. parsed-literal::
+
+    array([1, 2, 3])
+
+
+
+the expression ``a < 4`` produces the Boolean array,
+
+.. code::
+
+    # code to be run in CPython
+    
+    a < 4
+
+
+
+.. parsed-literal::
+
+    array([ True,  True,  True, False, False])
+
+
+
+If you repeatedly have such conditions in a loop, you might have to
+peridically call the garbage collector to remove the Boolean arrays that
+are used only once.
+
+.. code::
+
+    # code to be run in CPython
+