circuitpython/extmod/ulab/docs/manual/source/ulab-programming.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911

Programming ulab
================

Earlier we have seen, how ``ulab``\ ’s functions and methods can be
accessed in ``micropython``. This last section of the book explains, how
these functions are implemented. By the end of this chapter, not only
would you be able to extend ``ulab``, and write your own
``numpy``-compatible functions, but through a deeper understanding of
the inner workings of the functions, you would also be able to see what
the trade-offs are at the ``python`` level.

Code organisation
-----------------

As mentioned earlier, the ``python`` functions are organised into
sub-modules at the C level. The C sub-modules can be found in
``./ulab/code/``.

The ``ndarray`` object
----------------------

General comments
~~~~~~~~~~~~~~~~

``ndarrays`` are efficient containers of numerical data of the same type
(i.e., signed/unsigned chars, signed/unsigned integers or
``mp_float_t``\ s, which, depending on the platform, are either C
``float``\ s, or C ``double``\ s). Beyond storing the actual data in the
void pointer ``*array``, the type definition has eight additional
members (on top of the ``base`` type). Namely, the ``dtype``, which
tells us, how the bytes are to be interpreted. Moreover, the
``itemsize``, which stores the size of a single entry in the array,
``boolean``, an unsigned integer, which determines, whether the arrays
is to be treated as a set of Booleans, or as numerical data, ``ndim``,
the number of dimensions (``uint8_t``), ``len``, the length of the array
(the number of entries), the shape (``*size_t``), the strides
(``*int32_t``). The length is simply the product of the numbers in
``shape``.

The type definition is as follows:

.. code:: c

   typedef struct _ndarray_obj_t {
       mp_obj_base_t base;
       uint8_t dtype;
       uint8_t itemsize;
       uint8_t boolean;
       uint8_t ndim;
       size_t len;
       size_t shape[ULAB_MAX_DIMS];
       int32_t strides[ULAB_MAX_DIMS];
       void *array;
   } ndarray_obj_t;

Memory layout
~~~~~~~~~~~~~

The values of an ``ndarray`` are stored in a contiguous segment in the
RAM. The ``ndarray`` can be dense, meaning that all numbers in the
linear memory segment belong to a linar combination of coordinates, and
it can also be sparse, i.e., some elements of the linear storage space
will be skipped, when the elements of the tensor are traversed.

In the RAM, the position of the item
:math:`M(n_1, n_2, ..., n_{k-1}, n_k)` in a dense tensor of rank
:math:`k` is given by the linear combination

:raw-latex:`\begin{equation}
P(n_1, n_2, ..., n_{k-1}, n_k) = n_1 s_1 + n_2 s_2 + ... + n_{k-1}s_{k-1} + n_ks_k = \sum_{i=1}^{k}n_is_i
\end{equation}` where :math:`s_i` are the strides of the tensor, defined
as

:raw-latex:`\begin{equation}
s_i = \prod_{j=i+1}^k l_j
\end{equation}`

where :math:`l_j` is length of the tensor along the :math:`j`\ th axis.
When the tensor is sparse (e.g., when the tensor is sliced), the strides
along a particular axis will be multiplied by a non-zero integer. If
this integer is different to :math:`\pm 1`, the linear combination above
cannot access all elements in the RAM, i.e., some numbers will be
skipped. Note that :math:`|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|`, even
if the tensor is sparse. The statement is trivial for dense tensors, and
it follows from the definition of :math:`s_i`. For sparse tensors, a
slice cannot have a step larger than the shape along that axis. But for
dense tensors, :math:`s_i/s_{i+1} = l_i`.

When creating a *view*, we simply re-calculate the ``strides``, and
re-set the ``*array`` pointer.

Iterating over elements of a tensor
-----------------------------------

The ``shape`` and ``strides`` members of the array tell us how we have
to move our pointer, when we want to read out the numbers. For technical
reasons that will become clear later, the numbers in ``shape`` and in
``strides`` are aligned to the right, and begin on the right hand side,
i.e., if the number of possible dimensions is ``ULAB_MAX_DIMS``, then
``shape[ULAB_MAX_DIMS-1]`` is the length of the last axis,
``shape[ULAB_MAX_DIMS-2]`` is the length of the last but one axis, and
so on. If the number of actual dimensions, ``ndim < ULAB_MAX_DIMS``, the
first ``ULAB_MAX_DIMS - ndim`` entries in ``shape`` and ``strides`` will
be equal to zero, but they could, in fact, be assigned any value,
because these will never be accessed in an operation.

With this definition of the strides, the linear combination in
:math:`P(n_1, n_2, ..., n_{k-1}, n_k)` is a one-to-one mapping from the
space of tensor coordinates, :math:`(n_1, n_2, ..., n_{k-1}, n_k)`, and
the coordinate in the linear array,
:math:`n_1s_1 + n_2s_2 + ... + n_{k-1}s_{k-1} + n_ks_k`, i.e., no two
distinct sets of coordinates will result in the same position in the
linear array.

Since the ``strides`` are given in terms of bytes, when we iterate over
an array, the void data pointer is usually cast to ``uint8_t``, and the
values are converted using the proper data type stored in
``ndarray->dtype``. However, there might be cases, when it makes perfect
sense to cast ``*array`` to a different type, in which case the
``strides`` have to be re-scaled by the value of ``ndarray->itemsize``.

Iterating using the unwrapped loops
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following macro definition is taken from
`vector.h <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.h>`__,
and demonstrates, how we can iterate over a single array in four
dimensions.

.. code:: c

   #define ITERATE_VECTOR(type, array, source, sarray) do {
       size_t i=0;
       do {
           size_t j = 0;
           do {
               size_t k = 0;
               do {
                   size_t l = 0;
                   do {
                       *(array)++ = f(*((type *)(sarray)));
                       (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];
                       l++;
                   } while(l < (source)->shape[ULAB_MAX_DIMS-1]);
                   (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];
                   (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];
                   k++;
               } while(k < (source)->shape[ULAB_MAX_DIMS-2]);
               (sarray) -= (source)->strides[ULAB_MAX_DIMS - 2] * (source)->shape[ULAB_MAX_DIMS-2];
               (sarray) += (source)->strides[ULAB_MAX_DIMS - 3];
               j++;
           } while(j < (source)->shape[ULAB_MAX_DIMS-3]);
           (sarray) -= (source)->strides[ULAB_MAX_DIMS - 3] * (source)->shape[ULAB_MAX_DIMS-3];
           (sarray) += (source)->strides[ULAB_MAX_DIMS - 4];
           i++;
       } while(i < (source)->shape[ULAB_MAX_DIMS-4]);
   } while(0)

We start with the innermost loop, the one recursing ``l``. ``array`` is
already of type ``mp_float_t``, while the source array, ``sarray``, has
been cast to ``uint8_t`` in the calling function. The numbers contained
in ``sarray`` have to be read out in the proper type dictated by
``ndarray->dtype``. This is what happens in the statement
``*((type *)(sarray))``, and this number is then fed into the function
``f``. Vectorised mathematical functions produce *dense* arrays, and for
this reason, we can simply advance the ``array`` pointer.

The advancing of the ``sarray`` pointer is a bit more involving: first,
in the innermost loop, we simply move forward by the amount given by the
last stride, which is ``(source)->strides[ULAB_MAX_DIMS - 1]``, because
the ``shape`` and the ``strides`` are aligned to the right. We move the
pointer as many times as given by ``(source)->shape[ULAB_MAX_DIMS-1]``,
which is the length of the very last axis. Hence the the structure of
the loop

.. code:: c

       size_t l = 0;
       do {
           ...
           l++;
       } while(l < (source)->shape[ULAB_MAX_DIMS-1]);

Once we have exhausted the last axis, we have to re-wind the pointer,
and advance it by an amount given by the last but one stride. Keep in
mind that in the the innermost loop we moved our pointer
``(source)->shape[ULAB_MAX_DIMS-1]`` times by
``(source)->strides[ULAB_MAX_DIMS - 1]``, i.e., we re-wind it by moving
it backwards by
``(source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1]``.
In the next step, we move forward by
``(source)->strides[ULAB_MAX_DIMS - 2]``, which is the last but one
stride.

.. code:: c

       (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];
       (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];

This pattern must be repeated for each axis of the array, and this is
how we arrive at the four nested loops listed above.

Re-winding arrays by means of a function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to un-wrapping the iteration loops by means of macros, there
is another way of traversing all elements of a tensor: we note that,
since :math:`|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|`,
:math:`P(n1, n2, ..., n_{k-1}, n_k)` changes most slowly in the last
coordinate. Hence, if we start from the very beginning, (:math:`n_i = 0`
for all :math:`i`), and walk along the linear RAM segment, we increment
the value of :math:`n_k` as long as :math:`n_k < l_k`. Once
:math:`n_k = l_k`, we have to reset :math:`n_k` to 0, and increment
:math:`n_{k-1}` by one. After each such round, :math:`n_{k-1}` will be
incremented by one, as long as :math:`n_{k-1} < l_{k-1}`. Once
:math:`n_{k-1} = l_{k-1}`, we reset both :math:`n_k`, and
:math:`n_{k-1}` to 0, and increment :math:`n_{k-2}` by one.

Rewinding the arrays in this way is implemented in the function
``ndarray_rewind_array`` in
`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__.

.. code:: c

   void ndarray_rewind_array(uint8_t ndim, uint8_t *array, size_t *shape, int32_t *strides, size_t *coords) {
       // resets the data pointer of a single array, whenever an axis is full
       // since we always iterate over the very last axis, we have to keep track of
       // the last ndim-2 axes only
       array -= shape[ULAB_MAX_DIMS - 1] * strides[ULAB_MAX_DIMS - 1];
       array += strides[ULAB_MAX_DIMS - 2];
       for(uint8_t i=1; i < ndim-1; i++) {
           coords[ULAB_MAX_DIMS - 1 - i] += 1;
           if(coords[ULAB_MAX_DIMS - 1 - i] == shape[ULAB_MAX_DIMS - 1 - i]) { // we are at a dimension boundary
               array -= shape[ULAB_MAX_DIMS - 1 - i] * strides[ULAB_MAX_DIMS - 1 - i];
               array += strides[ULAB_MAX_DIMS - 2 - i];
               coords[ULAB_MAX_DIMS - 1 - i] = 0;
               coords[ULAB_MAX_DIMS - 2 - i] += 1;
           } else { // coordinates can change only, if the last coordinate changes
               return;
           }
       }
   }

and the function would be called as in the snippet below. Note that the
innermost loop is factored out, so that we can save the ``if(...)``
statement for the last axis.

.. code:: c

       size_t *coords = ndarray_new_coords(results->ndim);
       for(size_t i=0; i < results->len/results->shape[ULAB_MAX_DIMS -1]; i++) {
           size_t l = 0;
           do {
               ...
               l++;
           } while(l < results->shape[ULAB_MAX_DIMS - 1]);
           ndarray_rewind_array(results->ndim, array, results->shape, strides, coords);
       } while(0)

The advantage of this method is that the implementation is independent
of the number of dimensions: the iteration requires more or less the
same flash space for 2 dimensions as for 22. However, the price we have
to pay for this convenience is the extra function call.

Iterating over two ndarrays simultaneously: broadcasting
--------------------------------------------------------

Whenever we invoke a binary operator, call a function with two arguments
of ``ndarray`` type, or assign something to an ``ndarray``, we have to
iterate over two views at the same time. The task is trivial, if the two
``ndarray``\ s in question have the same shape (but not necessarily the
same set of strides), because in this case, we can still iterate in the
same loop. All that happens is that we move two data pointers in sync.

The problem becomes a bit more involving, when the shapes of the two
``ndarray``\ s are not identical. For such cases, ``numpy`` defines
so-called broadcasting, which boils down to two rules.

1. The shapes in the tensor with lower rank has to be prepended with
   axes of size 1 till the two ranks become equal.
2. Along all axes the two tensors should have the same size, or one of
   the sizes must be 1.

If, after applying the first rule the second is not satisfied, the two
``ndarray``\ s cannot be broadcast together.

Now, let us suppose that we have two compatible ``ndarray``\ s, i.e.,
after applying the first rule, the second is satisfied. How do we
iterate over the elements in the tensors?

We should recall, what exactly we do, when iterating over a single
array: normally, we move the data pointer by the last stride, except,
when we arrive at a dimension boundary (when the last axis is
exhausted). At that point, we move the pointer by an amount dictated by
the strides. And this is the key: *dictated by the strides*. Now, if we
have two arrays that are originally not compatible, we define new
strides for them, and use these in the iteration. With that, we are back
to the case, where we had two compatible arrays.

Now, let us look at the second broadcasting rule: if the two arrays have
the same size, we take both ``ndarray``\ s’ strides along that axis. If,
on the other hand, one of the ``ndarray``\ s is of length 1 along one of
its axes, we set the corresponding strides to 0. This will ensure that
that data pointer is not moved, when we iterate over both ``ndarray``\ s
at the same time.

Thus, in order to implement broadcasting, we first have to check,
whether the two above-mentioned rules can be satisfied, and if so, we
have to find the two new sets strides.

The ``ndarray_can_broadcast`` function from
`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__
takes two ``ndarray``\ s, and returns ``true``, if the two arrays can be
broadcast together. At the same time, it also calculates new strides for
the two arrays, so that they can be iterated over at the same time.

.. code:: c

   bool ndarray_can_broadcast(ndarray_obj_t *lhs, ndarray_obj_t *rhs, uint8_t *ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
       // returns True or False, depending on, whether the two arrays can be broadcast together
       // numpy's broadcasting rules are as follows:
       //
       // 1. the two shapes are either equal
       // 2. one of the shapes is 1
       memset(lstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);
       memset(rstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);
       lstrides[ULAB_MAX_DIMS - 1] = lhs->strides[ULAB_MAX_DIMS - 1];
       rstrides[ULAB_MAX_DIMS - 1] = rhs->strides[ULAB_MAX_DIMS - 1];
       for(uint8_t i=ULAB_MAX_DIMS; i > 0; i--) {
           if((lhs->shape[i-1] == rhs->shape[i-1]) || (lhs->shape[i-1] == 0) || (lhs->shape[i-1] == 1) ||
           (rhs->shape[i-1] == 0) || (rhs->shape[i-1] == 1)) {
               shape[i-1] = MAX(lhs->shape[i-1], rhs->shape[i-1]);
               if(shape[i-1] > 0) (*ndim)++;
               if(lhs->shape[i-1] < 2) {
                   lstrides[i-1] = 0;
               } else {
                   lstrides[i-1] = lhs->strides[i-1];
               }
               if(rhs->shape[i-1] < 2) {
                   rstrides[i-1] = 0;
               } else {
                   rstrides[i-1] = rhs->strides[i-1];
               }
           } else {
               return false;
           }
       }
       return true;
   }

A good example of how the function would be called can be found in
`vector.c <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.c>`__,
in the ``vector_arctan2`` function:

.. code:: c

   mp_obj_t vectorise_arctan2(mp_obj_t y, mp_obj_t x) {
       ...
       uint8_t ndim = 0;
       size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
       int32_t *xstrides = m_new(int32_t, ULAB_MAX_DIMS);
       int32_t *ystrides = m_new(int32_t, ULAB_MAX_DIMS);
       if(!ndarray_can_broadcast(ndarray_x, ndarray_y, &ndim, shape, xstrides, ystrides)) {
           mp_raise_ValueError(translate("operands could not be broadcast together"));
           m_del(size_t, shape, ULAB_MAX_DIMS);
           m_del(int32_t, xstrides, ULAB_MAX_DIMS);
           m_del(int32_t, ystrides, ULAB_MAX_DIMS);
       }

       uint8_t *xarray = (uint8_t *)ndarray_x->array;
       uint8_t *yarray = (uint8_t *)ndarray_y->array;
       
       ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
       mp_float_t *rarray = (mp_float_t *)results->array;
       ...

After the new strides have been calculated, the iteration loop is
identical to what we discussed in the previous section.

Contracting an ``ndarray``
--------------------------

There are many operations that reduce the number of dimensions of an
``ndarray`` by 1, i.e., that remove an axis from the tensor. The drill
is the same as before, with the exception that first we have to remove
the ``strides`` and ``shape`` that corresponds to the axis along which
we intend to contract. The ``numerical_reduce_axes`` function from
`numerical.c <https://github.com/v923z/micropython-ulab/blob/master/code/numerical/numerical.c>`__
does that.

.. code:: c

   static void numerical_reduce_axes(ndarray_obj_t *ndarray, int8_t axis, size_t *shape, int32_t *strides) {
       // removes the values corresponding to a single axis from the shape and strides array
       uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + axis;
       if((ndarray->ndim == 1) && (axis == 0)) {
           index = 0;
           shape[ULAB_MAX_DIMS - 1] = 0;
           return;
       }
       for(uint8_t i = ULAB_MAX_DIMS - 1; i > 0; i--) {
           if(i > index) {
               shape[i] = ndarray->shape[i];
               strides[i] = ndarray->strides[i];
           } else {
               shape[i] = ndarray->shape[i-1];
               strides[i] = ndarray->strides[i-1];
           }
       }
   }

Once the reduced ``strides`` and ``shape`` are known, we place the axis
in question in the innermost loop, and wrap it with the loops, whose
coordinates are in the ``strides``, and ``shape`` arrays. The
``RUN_STD`` macro from
`numerical.h <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/numerical/numerical.h>`__
is a good example. The macro is expanded in the
``numerical_sum_mean_std_ndarray`` function.

.. code:: c

   static mp_obj_t numerical_sum_mean_std_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, uint8_t optype, size_t ddof) {
       uint8_t *array = (uint8_t *)ndarray->array;
       size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
       memset(shape, 0, sizeof(size_t)*ULAB_MAX_DIMS);
       int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
       memset(strides, 0, sizeof(uint32_t)*ULAB_MAX_DIMS);

       int8_t ax = mp_obj_get_int(axis);
       if(ax < 0) ax += ndarray->ndim;
       if((ax < 0) || (ax > ndarray->ndim - 1)) {
           mp_raise_ValueError(translate("index out of range"));
       }
       numerical_reduce_axes(ndarray, ax, shape, strides);
       uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + ax;
       ndarray_obj_t *results = NULL;
       uint8_t *rarray = NULL;
       ...

Here is the macro for the three-dimensional case:

.. code:: c

   #define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {
       size_t k = 0;
       do {
           size_t l = 0;
           do {
               RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));
               (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];
               (array) += (strides)[ULAB_MAX_DIMS - 1];
               l++;
           } while(l < (shape)[ULAB_MAX_DIMS - 1]);
           (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];
           (array) += (strides)[ULAB_MAX_DIMS - 3];
           k++;
       } while(k < (shape)[ULAB_MAX_DIMS - 2]);
   } while(0)

In ``RUN_STD``, we simply move our pointers; the calculation itself
happens in the ``RUN_STD1`` macro below. (Note that this is the
implementation of the numerically stable Welford algorithm.)

.. code:: c

   #define RUN_STD1(ndarray, type, array, results, r, index, div)
   ({
       mp_float_t M, m, S = 0.0, s = 0.0;
       M = m = *(mp_float_t *)((type *)(array));
       for(size_t i=1; i < (ndarray)->shape[(index)]; i++) {
           (array) += (ndarray)->strides[(index)];
           mp_float_t value = *(mp_float_t *)((type *)(array));
           m = M + (value - M) / (mp_float_t)i;
           s = S + (value - M) * (value - m);
           M = m;
           S = s;
       }
       (array) += (ndarray)->strides[(index)];
       *(r)++ = MICROPY_FLOAT_C_FUN(sqrt)((ndarray)->shape[(index)] * s / (div));
   })

Upcasting
---------

When in an operation the ``dtype``\ s of two arrays are different, the
result’s ``dtype`` will be decided by the following upcasting rules:

1. Operations with two ``ndarray``\ s of the same ``dtype`` preserve
   their ``dtype``, even when the results overflow.

2. if either of the operands is a float, the result automatically
   becomes a float

3. otherwise

   -  ``uint8`` + ``int8`` => ``int16``,

   -  ``uint8`` + ``int16`` => ``int16``

   -  ``uint8`` + ``uint16`` => ``uint16``

   -  ``int8`` + ``int16`` => ``int16``

   -  ``int8`` + ``uint16`` => ``uint16`` (in numpy, the result is a
      ``int32``)

   -  ``uint16`` + ``int16`` => ``float`` (in numpy, the result is a
      ``int32``)

4. When one operand of a binary operation is a generic scalar
   ``micropython`` variable, i.e., ``mp_obj_int``, or ``mp_obj_float``,
   it will be converted to a linear array of length 1, and with the
   smallest ``dtype`` that can accommodate the variable in question.
   After that the broadcasting rules apply, as described in the section
   `Iterating over two ndarrays simultaneously:
   broadcasting <#Iterating_over_two_ndarrays_simultaneously:_broadcasting>`__

Upcasting is resolved in place, wherever it is required. Notable
examples can be found in
`ndarray_operators.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray_operators.c>`__

Slicing and indexing
--------------------

An ``ndarray`` can be indexed with three types of objects: integer
scalars, slices, and another ``ndarray``, whose elements are either
integer scalars, or Booleans. Since slice and integer indices can be
thought of as modifications of the ``strides``, these indices return a
view of the ``ndarray``. This statement does not hold for ``ndarray``
indices, and therefore, the return a copy of the array.

Extending ulab
--------------

The ``user`` module is disabled by default, as can be seen from the last
couple of lines of
`ulab.h <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h>`__

.. code:: c

   // user-defined module
   #ifndef ULAB_USER_MODULE
   #define ULAB_USER_MODULE                (0)
   #endif

The module contains a very simple function, ``user_dummy``, and this
function is bound to the module itself. In other words, even if the
module is enabled, one has to ``import``:

.. code:: python


   import ulab
   from ulab import user

   user.dummy_function(2.5)

which should just return 5.0. Even if ``numpy``-compatibility is
required (i.e., if most functions are bound at the top level to ``ulab``
directly), having to ``import`` the module has a great advantage.
Namely, only the
`user.h <https://github.com/v923z/micropython-ulab/blob/master/code/user/user.h>`__
and
`user.c <https://github.com/v923z/micropython-ulab/blob/master/code/user/user.c>`__
files have to be modified, thus it should be relatively straightforward
to update your local copy from
`github <https://github.com/v923z/micropython-ulab/blob/master/>`__.

Now, let us see, how we can add a more meaningful function.

Creating a new ndarray
----------------------

In the `General comments <#General_comments>`__ sections we have seen
the type definition of an ``ndarray``. This structure can be generated
by means of a couple of functions listed in
`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__.

ndarray_new_ndarray
~~~~~~~~~~~~~~~~~~~

The ``ndarray_new_ndarray`` functions is called by all other
array-generating functions. It takes the number of dimensions, ``ndim``,
a ``uint8_t``, the ``shape``, a pointer to ``size_t``, the ``strides``,
a pointer to ``int32_t``, and ``dtype``, another ``uint8_t`` as its
arguments, and returns a new array with all entries initialised to 0.

Assuming that ``ULAB_MAX_DIMS > 2``, a new dense array of dimension 3,
of ``shape`` (3, 4, 5), of ``strides`` (1000, 200, 10), and ``dtype``
``uint16_t`` can be generated by the following instructions

.. code:: c

   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
   shape[ULAB_MAX_DIMS - 1] = 5;
   shape[ULAB_MAX_DIMS - 2] = 4;
   shape[ULAB_MAX_DIMS - 3] = 3;

   int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
   strides[ULAB_MAX_DIMS - 1] = 10;
   strides[ULAB_MAX_DIMS - 2] = 200;
   strides[ULAB_MAX_DIMS - 3] = 1000;

   ndarray_obj_t *new_ndarray = ndarray_new_ndarray(3, shape, strides, NDARRAY_UINT16);

ndarray_new_dense_ndarray
~~~~~~~~~~~~~~~~~~~~~~~~~

The functions simply calculates the ``strides`` from the ``shape``, and
calls ``ndarray_new_ndarray``. Assuming that ``ULAB_MAX_DIMS > 2``, a
new dense array of dimension 3, of ``shape`` (3, 4, 5), and ``dtype``
``mp_float_t`` can be generated by the following instructions

.. code:: c

   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
   shape[ULAB_MAX_DIMS - 1] = 5;
   shape[ULAB_MAX_DIMS - 2] = 4;
   shape[ULAB_MAX_DIMS - 3] = 3;

   ndarray_obj_t *new_ndarray = ndarray_new_dense_ndarray(3, shape, NDARRAY_FLOAT);

ndarray_new_linear_array
~~~~~~~~~~~~~~~~~~~~~~~~

Since the dimensions of a linear array are known (1), the
``ndarray_new_linear_array`` takes the ``length``, a ``size_t``, and the
``dtype``, an ``uint8_t``. Internally, ``ndarray_new_linear_array``
generates the ``shape`` array, and calls ``ndarray_new_dense_array``
with ``ndim = 1``.

A linear array of length 100, and ``dtype`` ``uint8`` could be created
by the function call

.. code:: c

   ndarray_obj_t *new_ndarray = ndarray_new_linear_array(100, NDARRAY_UINT8)

ndarray_new_ndarray_from_tuple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This function takes a ``tuple``, which should hold the lengths of the
axes (in other words, the ``shape``), and the ``dtype``, and calls
internally ``ndarray_new_dense_array``. A new ``ndarray`` can be
generated by calling

.. code:: c

   ndarray_obj_t *new_ndarray = ndarray_new_ndarray_from_tuple(shape, NDARRAY_FLOAT);

where ``shape`` is a tuple.

ndarray_new_view
~~~~~~~~~~~~~~~~

This function crates a *view*, and takes the source, an ``ndarray``, the
number of dimensions, an ``uint8_t``, the ``shape``, a pointer to
``size_t``, the ``strides``, a pointer to ``int32_t``, and the offset,
an ``int32_t`` as arguments. The offset is the number of bytes by which
the void ``array`` pointer is shifted. E.g., the ``python`` statement

.. code:: python

   a = np.array([0, 1, 2, 3, 4, 5], dtype=uint8)
   b = a[1::2]

produces the array

.. code:: python

   array([1, 3, 5], dtype=uint8)

which holds its data at position ``x0 + 1``, if ``a``\ ’s pointer is at
``x0``. In this particular case, the offset is 1.

The array ``b`` from the example above could be generated as

.. code:: c

   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
   shape[ULAB_MAX_DIMS - 1] = 3;

   int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
   strides[ULAB_MAX_DIMS - 1] = 2;

   int32_t offset = 1;
   uint8_t ndim = 1;

   ndarray_obj_t *new_ndarray = ndarray_new_view(ndarray_a, ndim, shape, strides, offset);

ndarray_copy_array
~~~~~~~~~~~~~~~~~~

The ``ndarray_copy_array`` function can be used for copying the contents
of an array. Note that the target array has to be created beforehand.
E.g., a one-to-one copy can be gotten by

.. code:: c

   ndarray_obj_t *new_ndarray = ndarray_new_ndarray(source->ndim, source->shape, source->strides, source->dtype);
   ndarray_copy_array(source, new_ndarray);

Note that the function cannot be used for forcing type conversion, i.e.,
the input and output types must be identical, because the function
simply calls the ``memcpy`` function. On the other hand, the input and
output ``strides`` do not necessarily have to be equal.

ndarray_copy_view
~~~~~~~~~~~~~~~~~

The ``ndarray_obj_t *new_ndarray = ...`` instruction can be saved by
calling the ``ndarray_copy_view`` function with the single ``source``
argument.

Accessing data in the ndarray
-----------------------------

Having seen, how arrays can be generated and copied, it is time to look
at how the data in an ``ndarray`` can be accessed and modified.

For starters, let us suppose that the object in question comes from the
user (i.e., via the ``micropython`` interface), First, we have to
acquire a pointer to the ``ndarray`` by calling

.. code:: c

   ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(object_in);

If it is not clear, whether the object is an ``ndarray`` (e.g., if we
want to write a function that can take ``ndarray``\ s, and other
iterables as its argument), we find this out by evaluating

.. code:: c

   mp_obj_is_type(object_in, &ulab_ndarray_type)

which should return ``true``. Once the pointer is at our disposal, we
can get a pointer to the underlying numerical array as discussed
earlier, i.e.,

.. code:: c

   uint8_t *array = (uint8_t *)ndarray->array;

If you need to find out the ``dtype`` of the array, you can get it by
accessing the ``dtype`` member of the ``ndarray``, i.e.,

.. code:: c

   ndarray->dtype

should be equal to ``B``, ``b``, ``H``, ``h``, or ``f``. The size of a
single item is stored in the ``itemsize`` member. This number should be
equal to 1, if the ``dtype`` is ``B``, or ``b``, 2, if the ``dtype`` is
``H``, or ``h``, 4, if the ``dtype`` is ``f``, and 8 for ``d``.

Boilerplate
-----------

In the next section, we will construct a function that generates the
element-wise square of a dense array, otherwise, raises a ``TypeError``
exception. Dense arrays can easily be iterated over, since we do not
have to care about the ``shape`` and the ``strides``. If the array is
sparse, the section `Iterating over elements of a
tensor <#Iterating-over-elements-of-a-tensor>`__ should contain hints as
to how the iteration can be implemented.

The function is listed under
`user.c <https://github.com/v923z/micropython-ulab/tree/master/code/user/>`__.
The ``user`` module is bound to ``ulab`` in
`ulab.c <https://github.com/v923z/micropython-ulab/tree/master/code/ulab.c>`__
in the lines

.. code:: c

       #if ULAB_USER_MODULE
           { MP_ROM_QSTR(MP_QSTR_user), MP_ROM_PTR(&ulab_user_module) },
       #endif

which assumes that at the very end of
`ulab.h <https://github.com/v923z/micropython-ulab/tree/master/code/ulab.h>`__
the

.. code:: c

   // user-defined module
   #ifndef ULAB_USER_MODULE
   #define ULAB_USER_MODULE                (1)
   #endif

constant has been set to 1. After compilation, you can call a particular
``user`` function in ``python`` by importing the module first, i.e.,

.. code:: python

   from ulab import numpy as np
   from ulab import user

   user.some_function(...)

This separation of user-defined functions from the rest of the code
ensures that the integrity of the main module and all its functions are
always preserved. Even in case of a catastrophic failure, you can
exclude the ``user`` module, and start over.

And now the function:

.. code:: c

   static mp_obj_t user_square(mp_obj_t arg) {
       // the function takes a single dense ndarray, and calculates the 
       // element-wise square of its entries
       
       // raise a TypeError exception, if the input is not an ndarray
       if(!mp_obj_is_type(arg, &ulab_ndarray_type)) {
           mp_raise_TypeError(translate("input must be an ndarray"));
       }
       ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);
       
       // make sure that the input is a dense array
       if(!ndarray_is_dense(ndarray)) {
           mp_raise_TypeError(translate("input must be a dense ndarray"));
       }
       
       // if the input is a dense array, create `results` with the same number of 
       // dimensions, shape, and dtype
       ndarray_obj_t *results = ndarray_new_dense_ndarray(ndarray->ndim, ndarray->shape, ndarray->dtype);
       
       // since in a dense array the iteration over the elements is trivial, we 
       // can cast the data arrays ndarray->array and results->array to the actual type
       if(ndarray->dtype == NDARRAY_UINT8) {
           uint8_t *array = (uint8_t *)ndarray->array;
           uint8_t *rarray = (uint8_t *)results->array;
           for(size_t i=0; i < ndarray->len; i++, array++) {
               *rarray++ = (*array) * (*array);
           }
       } else if(ndarray->dtype == NDARRAY_INT8) {
           int8_t *array = (int8_t *)ndarray->array;
           int8_t *rarray = (int8_t *)results->array;
           for(size_t i=0; i < ndarray->len; i++, array++) {
               *rarray++ = (*array) * (*array);
           }
       } else if(ndarray->dtype == NDARRAY_UINT16) {
           uint16_t *array = (uint16_t *)ndarray->array;
           uint16_t *rarray = (uint16_t *)results->array;
           for(size_t i=0; i < ndarray->len; i++, array++) {
               *rarray++ = (*array) * (*array);
           }
       } else if(ndarray->dtype == NDARRAY_INT16) {
           int16_t *array = (int16_t *)ndarray->array;
           int16_t *rarray = (int16_t *)results->array;
           for(size_t i=0; i < ndarray->len; i++, array++) {
               *rarray++ = (*array) * (*array);
           }
       } else { // if we end up here, the dtype is NDARRAY_FLOAT
           mp_float_t *array = (mp_float_t *)ndarray->array;
           mp_float_t *rarray = (mp_float_t *)results->array;
           for(size_t i=0; i < ndarray->len; i++, array++) {
               *rarray++ = (*array) * (*array);
           }        
       }
       // at the end, return a micropython object
       return MP_OBJ_FROM_PTR(results);
   }

To summarise, the steps for *implementing* a function are

1. If necessary, inspect the type of the input object, which is always a
   ``mp_obj_t`` object
2. If the input is an ``ndarray_obj_t``, acquire a pointer to it by
   calling ``ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);``
3. Create a new array, or modify the existing one; get a pointer to the
   data by calling ``uint8_t *array = (uint8_t *)ndarray->array;``, or
   something equivalent
4. Once the new data have been calculated, return a ``micropython``
   object by calling ``MP_OBJ_FROM_PTR(...)``.

The listing above contains the implementation of the function, but as
such, it cannot be called from ``python``: it still has to be bound to
the name space. This we do by first defining a function object in

.. code:: c

   MP_DEFINE_CONST_FUN_OBJ_1(user_square_obj, user_square);

``micropython`` defines a number of ``MP_DEFINE_CONST_FUN_OBJ_N`` macros
in
`obj.h <https://github.com/micropython/micropython/blob/master/py/obj.h>`__.
``N`` is always the number of arguments the function takes. We had a
function definition ``static mp_obj_t user_square(mp_obj_t arg)``, i.e.,
we dealt with a single argument.

Finally, we have to bind this function object in the globals table of
the ``user`` module:

.. code:: c

   STATIC const mp_rom_map_elem_t ulab_user_globals_table[] = {
       { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_user) },
       { MP_OBJ_NEW_QSTR(MP_QSTR_square), (mp_obj_t)&user_square_obj },
   };

Thus, the three steps required for the definition of a user-defined
function are

1. The low-level implementation of the function itself
2. The definition of a function object by calling
   MP_DEFINE_CONST_FUN_OBJ_N()
3. Binding this function object to the namespace in the
   ``ulab_user_globals_table[]``