Parquet datafiles

<< Click to Display Table of Contents >>

Navigation:  Appendix >

Parquet datafiles

Previous pageReturn to chapter overviewNext page

Gekko supports import and export of Apache Parquet files, storing dataframes in a so-called columnar representation, which supports metadata and is much more strict and efficient than for instance .csv. Think of the Gekko parquet representation as a dataframe/table that looks like this:

 

clip0207

 

The parquet file has two parts: rowgroup 0 (upper) and rowgroup 1 (lower). Rowgroup 0 contains data for all columns except date, period and value. Rowgroup 1 only contains data for columns id, date, period and value. Think of rowgroup 0 as metadata and other info (one row corresponds to 1 normal series or array-subseries). And think of rowgroup 1 as the raw data observations (period and value) for each series or array-series. Because rowgroup 1 is unfolded into the time dimension, rowgroup 1 is typically much larger than rowgroup 0.

 

Above, the first 4 rows are rowgroup 0, and the rest are rowgroup 1. The data contains a normal quarterly timeseries x0!q, a daily array-timeseries x1!d[a], and two weekly array-timeseries x1!w[a] and x1!w[b]. In rowgroup 1 it is seen that x0!q has 3 observations, x1!d[a] has 3 observations, x1!w[a] is timeless, and x1!w[b] has 3 observations.

 

To use the parquet file in for instance Python (or R or SQL etc.), the lower part of the table (rowgroup 1) is used. Before use of rowgroup 1, it is advised to merge all the rowgroup 0 metadata from the top-left part of the table into the bottom-left part of the table, using id as the key. See the Python merge example later on. (Storing all that metadata inside rowgroup 1 would create redundancies).

 

Column explanation:

 

id. String. For instance b:x!q[i,j].

bank. String. For instance b.

name. String. For instance x.

freq. String. For instance q.

dims. Integer. For instance 2.

dim1. String. For instance i.

dim2. String. For instance j. Depending on dims, there may be dim3, dim4, etc.

label. String. For instance Quarterly production in country i, sector j.

source. String. For instance OECD Economic Outlook.

unit. String. For instance Million USD.

is_timeless. Boolean. For instance false.

date_start. Unix time. For instance 2000-01-01T00:00:00.

date_end. Unix time. For instance 2025-10-01T00:00:00.

period_start. String. For instance 2000q1

period_end. String. For instance 2025q4.

stamp. Unix time. For instance 2025-07-01T00:00:00.

date. Unix time. For instance 2025-07-01T00:00:00.

period. String. For instance 2025q3.

value. Floating point value. For instance 97456.552.

 

 

Parquet metadata

 

The parquet file also contains the following metadata (as a string-string dictionary):

 

software.name = Gekko Timeseries and Modeling Software

software.version = 3.3.2 (for instance);

table.label = ...the databank heading...

parquet.design.version = 1.0 (for instance)

parquet.design.url = https://t-t.dk/gekko/docs/user-manual/index.html?appendix_parquet.htm

export.timestamp = 2025-12-09T08:55:55.8277140Z (for instance, format is ISO-8601).

column.id.comment = An id corresponding to the Gekko name, for merging rowgroup1 into rowgroup2. Only lower-case, no blanks. String.

column.bank.comment = Gekko databank name, often same as file name without extension (for future use, to store several databanks in 1 parquet file). String.

column.name.comment = The Gekko series name. Alphanumeric or underscore chars, lower or upper-case. String.

column.freq.comment = The Gekko frequency: a (annual), q (quarterly), m (monthly), w (weekly), d (daily), u (undated). Lower-case. String.

column.dims.comment = Number of dimensions of the given (array-) timeseries. Integer.

column.dim1 = Dimension 1. String.

column.dim2 = Dimension 2. String. Etc.

column.label.comment = The label of the given timeseries. String.

column.source.comment = The source of the given timeseries. String.

column.unit.comment = The unit of the given timeseries. String.

column.is_timeless.comment = True if the timeseries is constant for all periods. Boolean.

column.date_start.comment = The date corresponding to the first value of the timeseries in the Gekko databank. Date format, Unix time. Quarters etc. are identified as their *first* day. Not used by Gekko when importing.

column.date_end.comment = The date corresponding to the last value of the timeseries in the Gekko databank. Date format, Unix time. Quarters etc. are identified as their *first* day. Not used by Gekko when importing.

column.period_start.comment = The period corresponding to the first value of the timeseries in the Gekko databank. String format. Not used by Gekko when importing.

column.period_end.comment = The period corresponding to the last value of the timeseries in the Gekko databank. String format. Not used by Gekko when importing.

column.stamp.comment = The timestamp corresponding to the last time the timeseries was changed in the Gekko databank. Date format, Unix time.

column.date.comment = The date corresponding to the current data value. Date format, Unix time. Quarters etc. are identified as their *first* day. Not used by Gekko when importing.

column.period.comment = The period corresponding to the current data value. String.

column.value.comment = The data value. Numeric floating-point.

 

 

Python example

 

The following code shows how to transform a Gekko-produced parquet file (export<parquet> or write<parquet>) into a suitable Pandas dataframe (df), where the metadata is merged in.

 

import pyarrow.parquet as pq
import pandas as pd
 
parquet_file = pq.ParquetFile("...insert filename...")
 
df0 = parquet_file.read_row_group(0).to_pandas().drop(columns=['date', 'period', 'value'])
df1 = parquet_file.read_row_group(1).to_pandas()[['id', 'date', 'period', 'value']]
df = df0.merge(df1[['id']], on='id', how='right')
df[['date', 'period', 'value']] = df1[['date', 'period', 'value']].values
print(df)

 
print()
print("Gekko parquet version:  " + parquet_file.metadata.metadata.get(b"parquet.design.version").decode("utf-8"))
print("Gekko export timestamp: " + parquet_file.metadata.metadata.get(b"export.timestamp").decode("utf-8"))

 

 

Notes

 

Apache Parquet has a columnar representation similar to Apache Arrow. Think of parquet as a representation for efficient long-term storage, and arrow as a representation for fast in-memory processing. Gekko supports exporting (but not importing) arrow files, but parquet is advised instead of arrow, since parquet is much more developed in Gekko.

 

When exporting, Gekko only produces period formats like 2020q2, 2020m1d2, 2020w8 (Gekko style), not for instance 2020K2, 2020M01D02, 2020U08 or other formats (such formats may be implemented). Gekko uses period when importing, not date.

 

Details:

 

The bank column is really only there to be able to store more than 1 Gekko databank in the same parquet file, without risking name-collisions. The bank column values can be set with export<bankname=...>, and you can filter importing with import<bankname=...>. If bankname= is not provided when exporting, the parquet filename without extension is used. When importing, you do not need to state bankname=, if there is only 1 bankname present. At the moment, Gekko cannot export with > 1 bankname.

When importing, date_start, date_end, period_start, period_end, and date are not used at all.

A completely empty array-series (that is, without data) will not be exported to parquet.

If you are "producing" or manipulating a dataframe in for instance Python, R or SQL for later transfer to Gekko, please organize the rowgroup 1 dataframe so that the data observations/rows are assembled together for each series or array-subseries. This will make the data load much faster (so do not produce consecutive rows with x1 in 2020, x2 in 2020, x1 in 2021, x2 in 2021, but rather x1 in 2020, x1 in 2021, x2 in 2020, x2 in 2021).

id is strictly given from bank, name, freq, dim1, ... , dimN, when adding colon (:), frequency indicator (!), brackets ([ and ]) and commas (,). Everything in id is lower-case and without blanks. When importing, Gekko uses bank, name, dim1, ... , dimN from rowgroup 0 to preserve casing (freq is kept lower-case).

date_start corresponds to period_start, date_end corresponds to period_end, and date corresponds to period. Of these 6, when importing, Gekko only uses period.

Provided a suitable frequency (a/q/m/w/d), Gekko will import period like for instance 2020K2, 2020M01D02, 2020U08 correctly for Danish users, but cannot write using such formats. (When importing, the logic is to look for q/Q/k/K for quarters, m/M for months, w/W/u/U for weeks, d/D for days, and parse integers so that trailing zeroes are accepted).