Array-series

Array-series have not been described in this blog until now. The concept of array-series arose as a consequence of database integration on the one side, and GAMS integration on the other side. When interfacing with a particular Danish online database (Statistikbanken), there was the question of dimensionality. The database consists of a number of tables that are SQL-like in nature, that is, with columns describing dimensions or other attributes, and rows that represent “records”, or observations. For instance, a population table may consist of columns sex, age, and year, combined with rows that represent records/observations, for instance that there were 35082 Danish women of age 40 in the year 1990. To store such information, one could construct a number of timeseries containing these numbers, for instance pop_w_40 representing the 40-year old Danish women, with the value pop_w_40[1990] = 35082.

Using names like that could be called name-composition, since parts of the names represent the dimensions. This can work ok, but there is the question of whether to use for instance a ‘_’ delimiter in the series names, or a more terse name like popw40. Regarding the latter variant, there is the problem of name collisions. Consider for instance an input-output matrix, where anqz could represent deliveries from a sector ‘n’ to a sector ‘qz’. This is ok, if there is not a sector ‘nq’ and ‘z’, with a delivery between them, because in that case, the name would also become anqz. Using underscores, the names would be different: a_n_qz versus a_nq_z, but such names are not exactly aesthetical.

Instead of this, Gekko 2.4 and especially 3.0 proposes an array-series alternative, using brackets, where a_n_qz would become a[‘n’, ‘qz’], and a_nq_z would become a[‘nq’, ‘z’], and in the population case, pop_w_40 would become pop[‘w’, ‘40’]. The array-series indexes are strings, but these strings may also be written in ‘naked’ form, losing the quotes: a[n, qz], a[nq, z] and pop[q, 40]. This is possible because it would not make sense for a timeseries to be used as an index regarding an array-series.

Using for instance pop[w, 40] instead of pop_w_40 or popw40 is perhaps more readable, especially if there are many dimensions. But there are other advantages regarding array-series. If pop is defined as an array-series, it will function as a container/storage for all the sub-array-series, bundling them together inside one object. So pop[w, 40] is an integrated part of the array-series pop, whereas pop_w_40 is just a stand-alone timeseries that does not belong logically to anything (other than residing in a databank). Therefore, an array-series like pop may be thought of af kind of a complete table, with each sub-arrayseries representing a row in that table. This makes it possible to perform operations on the whole table/array-series as one object, for instance copying, deleting, printing it, etc.

In addition to this, array-series allow optional domain-restrictions on the dimension, so that, for instance, setting pop[f, 40] would issue an error, since the domain corresponding to the first dimension does not contain the element ‘f’ (women are called ‘w’). This counteracts misspellings etc., and restricts the dimension elements, so that they cannot just attain any string value. All the information regarding an array-series (number of elements, dimensions, domains, etc.) can be shown with the DISP command, in contrast to using name-composition, where such meta-information is only implicit.

Array-series support easy summing up of elements, for instance sum(#a, pop[w, #a]), where #a is a list of strings representing ages in the pop-array-series. If #a represents the age domain, restrictions (filtering) can be set with ‘$’, for instance sum(#a, pop[w, #a] $ (#a in #a4049)), where #a4049 is a subset of #a, for instance the age group 40-49. This is very similar to GAMS syntax, where the dimensions of variables and parameters are typically assigned to sets. The following syntax is also legal in Gekko, and looks even more like GAMS: sum(#a, pop[w, #a] $ (#a4049[#a])), but the $ (… in …) syntax is probably easier to read. Using ‘in’ in this way is used in Python and other languages, for instance (in Python): “if x in y:” or “for x in y:”, where the first one checks if x is a member of y, and the second one loops x over the elements of y.

Gekko also supports for instance printing of a whole arrayseries, for instance “PRT pop;”. This will print all the elements of the arrayseries, instead of having to state, for instance, “PRT pop[#s, #a];” explicitly. Simple mathematical operations on array-series are also possible, for instance “PRT pop*income;” where pop and income are two compatible array-series with the same dimensions. This will be further developed, providing a kind of algebra regarding array-series.

It should be mentioned that Gekko’s array-series are sparse: it is not the potential dimensionality of the array-series that matters regarding memory use, but only the number of concrete sub-series stored inside the array-series. So a Gekko array-series can be thought of as a table where the columns are the dimensions (excluding the time dimension), and where each row represents a sub-array-series. These sub-array-series are just normal timeseries, with data in the time dimension. For instance:

  Sex        Age     Value
  w          40      Timeseries
  w          41      Timeseries
  w          42      Timeseries

So pop[w, 40] is a (sub-)timeseries, defined over some time period (for instance 1990-2019), and pop[f, 41] is another timeseries, defined over the same time period. In a software system like GAMS, the data would be represented like this:

  Sex        Age     Year     Value
  w          40      1990     35082
  w          40      1991     35222
  w          40      1992     33596

Above, the the last column is the actual data. So in GAMS, pop(‘f’, ‘40’, ‘1990’) = 35082, whereas in Gekko, pop[f, 40] is a (sub-)series, where pop[f, 40][1990] = 35082. When summing up and printing, Gekko has some options to control how to handle elements (combinations) that do not exist. For instance, in 1990 there are no women of age 108 or more. There is an option “OPTION series array calc missing” to control this, where the default value is that Gekko will abort with an error, if an element is missing. Instead, the option can be set so that a missing element is represented as zero, thus not being counted when summing up. When printing, the “OPTION series array print missing” can be used, so that missing elements can be printed as missing values, zeroes, or alternatively simply skipped.

A table like the one shown above where the values (data cells) are additive can also be understood as (part of) a multidimensional OLAP-cube, cf. the illustration accompanying the blog post. At some point, dataframes (known from R (dplyr), Python (pandas) etc.) will also be introduced in Gekko. Dataframes are much like SQL tables, and they would ease the way data could be exchanged between Gekko and R, Python, etc.

When designing the array-series syntax, the AMPL syntax was also considered, but the GAMS syntax is actually quite compact (but on the downside less similar to mathematical notation than AMPL). For summing up with restrictions ($-operator), GAMS both supports the syntax corresponding to sum(#a, pop[w, #a] $ (#a in #a4049)), or the alternative sum(#a $ (#a in #a4049), pop[w, #a]), where in the first case, the loop is run over all elements in #a (setting some of these elements to 0), whereas in the second case, the loop itself only runs over a subset of #a, thus skipping some of the elements (both these variants are shown with Gekko-syntax). The latter variant is not implemented yet. The result would be the same, but the latter variant would run more efficiently (faster).

All in all, array-series are very convenient, both for data-manipulation (if data is multi-dimensional), for interfacing with databases or other software packages, and last but not least: for modeling. Gekko does not yet implement array-series in the modeling part of the software, but this will come sooner or later (hopefully sooner: until then, Gekko interfaces tightly with GAMS, and Gekko reads and writes gdx files and can also understand GAMS equations).

Note: a database or GAMS gdx file may contain data that is not in the time-domain, for instance parameters, constants, etc. Gekko will represent such data as an array-series where the sub-series are timeless. For convenience, it is possible to define a timeless timeseries in Gekko (such a series essentially behaves like a value (val scalar)).

Recent Posts

Recent Comments

Archives

Categories

Meta