<< Click to Display Table of Contents >> IMPORT |
The IMPORT statement merges data (typically series) from an external file into the first-position databank. The IMPORT statement is primarily for non-.gbk files, and it should be noted that IMPORT without options restricts data to the global time period, it only puts data into the first-position databank, and it merges data with any pre-existing data in the first-position databank.
Import supports collapsing (aggregating) data points of high frequency into annual, quartery, monthly, weekly or daily series, cf. the <collapse> option.
Compatibility note: If a time period is not indicated in the <>-option field, Gekko 3.0 will only import data inside the global time period. Before Gekko 3.0, all data would have been imported. To emulate previous behavior, you can use IMPORT<all>. Alternatively, you may set "OPTION bugfix import export = yes;". If the option is set, IMPORT and EXPORT will work as in pre-3.0 versions. The option will be removed at some point, so it is better to change occurrences of date-less IMPORT to IMPORT<all> in old Gekko program files. |
Tabular formats note: When using IMPORT with xlsx, csv or prn files, it is advised to first set the global frequency (option freq = ...) to the frequency of the data file (temporary frequency change can be done with block freq ...; import ... ; end;). With DATEFORMAT and DATETYPE at their default values, dates like 1990a1, 1990y or 90 are treated as annual 1990. A date like 199003 is treated as 1990q3 or 1990m3, if global frequency is set to q or m. A date like 19900325 is always understood as daily 1990m3d25. To import data with undated (u) frequency, the global frequency should be set to u first. |
IMPORT is intended for non-.gbk files, and can be thought of as a soft version of READ. In contrast to READ, IMPORT does not clear the first-position databank (instead it merges data), it only imports data corresponding to the global time period (unless a time period or <all> is used), and it does not alter the Ref databank. There are the following equivalences: IMPORT = READ<first merge respect>, and the inverse: READ = CLEAR<first> + IMPORT<all> + CLONE.
Because Gekko uses a so-called cache for databank files, the same file will load faster the next time it is imported.
import < period format ALL COLS REF SHEET=... CELL=... NAMECELL=... DATECELL=... COLLAPSE=... METHOD=... DATEFORMAT=... DATETYPE=... ARRAY VARIABLECODE > filename TO bankname;
period |
(Optional). Without a time period indicated, Gekko will import all the data for all observations. When a period is indicated, the databank is time-truncated. |
format |
(Optional). Choose between AREMOS, CSV, FLAT, GDX, PCIM, PRN, PX, TSD, TSP, XLS, XLSX.
•AREMOS: If you need more precision (significant digits) from AREMOS than the TSD format can provide, you may use AREMOS dump files. See example. •CSV: Comma-separated file. Tabular format with rows/cols consisting of names and dates. The global frequency (OPTION freq) must correspond to the frequency in the file. •FLAT: A Gekko-specific text-based format with lines that resemble Gekko series statements. The format is: variable name + start date + end date + numbers. These items are separated by blanks, for instance "x 2020 2022 1.5 -2.5 3.5". This corresponds to x <2020 2022> = 1.5, -2.5, 3.5;. If only one number is given, it will be used for the full time period. You can use 'm' or 'm()' to indicate a missing value. Blank lines and lines beginning with '//' are ignored. This format reads much faster than 'real' series statements (which have to be parsed and compiled before the values are extracted). •GCM: Gekko can export timeseries as a .gcm file, using export<gcm>. To "import" such a .gcm file, just RUN it (or copy-paste the contents into your preferred .gcm file). Note that .flat files are similar, but load much faster. •GDX: A binary GAMS-database. Note option gams exe folder = ... where it is possible to point to the exact GAMS folder (otherwise the system will try to auto-locate GAMS). Even without the option set, the auto-locator seems pretty good at locating a 32-bit GAMS for a 32-bit Gekko, and a 64-bit GAMS for a 64-bit Gekko version. Please note that the data is read as array-timeseries (see SERIES), and that Gekko only reads variables, parameters, sets (as Gekko lists) and domains. GAMS can be freely downloaded as a demo, and the demo will work fine regarding Gekko IMPORT. Default options are option gams time set = 't'; option gams time prefix = ''; option gams time offset = 0; option gams time detect auto = no;. This corresponds to time having the set name 't', with natural values, for instance 2020, 2021, etc. Default GAMS read is using a fast reader (low-level API). If this poses problems, try the more robust normal API by setting option gams fast = no;. See more under OPTION. •PCIM: A binary PCIM databank. •PRN: The first item in the prn format must be either 'date' or 'name' to indicate the orientation. The global frequency (option freq = ...) must correspond to the frequency in the file. •PX: Imports a PC-Axis file. See also the <array> option. See more info regarding the px format and how Gekko reads it under the DOWNLOAD statement. Note that the time element of the .px file (CODES("tid") or CODES("time")) should preferably be stated last (this is to keep the .px data file organised in the time dimension, which makes it load faster in Gekko). See also the array and variablecode options: specifically it is advised to use the variablecode option in order to get shorter timeseries names in the non-array case (these will correspond to what you get with direct DOWNLOAD). •SDF: Imports 'sdf' format. The format is a tab-separated 'table' without header row, with period in the second-last column, and value (with decimal comma) in the last column. •TSD: For interchange with AREMOS and others. •TSP: Imports TSP records. •XLS and XLSX: Tabular format with rows/cols consisting of names and dates. See SHEET<import> to see how to load a spreadsheet as a nested list of "cells" and scrape data from these. The global frequency (option freq = ...) must correspond to the frequency in the file. The engine used for Excel reading can be changed with option sheet engine = ...;.
|
filename |
Filenames may contain an absolute path like c:\projects\gekko\bank.gbk, or a relative path \gekko\bank.gbk. Filenames containing blanks and special characters should be put inside quotes. Regarding reading of files, files in libraries can be referred to with colon (for instance lib1:bank.gbk), and "zip paths" are allowed too (for instance c:\projects\data.zip\bank.gbk). See more on filenames here. If the filename is set to '*', you will be asked to choose the file in Windows Explorer. The extension .gbk is automatically added, if it is missing. |
ALL |
(Optional). With this option, all observations are imported, regardless of the global time period. This corresponds to pre-3.0 Gekko behavior. |
REF |
(Optional). Reads the file into the reference databank (shown as REF on the F2 window list). Note that the Ref/reference databanks does not show up in the F2 window if it is empty. |
COLS |
(Optional). For .csv or Excel files, this indicates whether the timeseries are running downwards in columns. Note that for .prn files, you indicate this in the first 'cell' (date/name). |
TO |
(Optional). If "TO bankname" is indicated, Gekko will put the data into a seperate 'named' databank alongside the Work and Ref databanks. For instance, after import <xlsx> adambk TO a;, you may refer to the variables by means of colon, for instance prt a:var1;. If you use import <xlsx> adambk TO *;, the bankname will be the same as the file name. It should be noted that the databank will be read-only (non-editable) when opened like this (this functionality is a subset of the OPEN statement) |
ARRAY |
(Optional). Regarding the PX format, if this option is set, Gekko will put the data into array-timeseries rather than normal timeseries (for the GDX format, Gekko always puts into array-timeseries per default). The variablecode option will not matter if array option is used. |
VARIABLECODE |
(Optional). Regarding the PX format, if this option is set, Gekko will use shorter variable names codes, for instance VAREGR instead of varegruppe. Using this option is recommended, and the DOWNLOAD statement uses it implicitly. The variablecode option will not matter if array option is used. |
SHEET= |
(Optional). The name of the sheet for your data, for instance 'Data1' |
CELL= |
(Optional). For Excel files: the first cell of the data section. Defaults to 'B2'. This allows for rows/columns between the dates, names and data cells (cf. DATECELL and NAMECELL). |
DATECELL= |
(Optional). For Excel files: the first cell of the dates labels. Calculated from CELL location if not provided. This allows for rows/columns between the dates, names and data cells (cf. CELL and NAMECELL). |
NAMECELL= |
(Optional). For Excel files: the first cell of the names labels. Calculated from CELL location if not provided. This allows for rows/columns between the dates, names and data cells (cf. CELL and DATECELL). |
COLLAPSE= |
(Optional). For Excel files with Excel-dates that are going to be collapsed, this option can be set to either a, q, m, w or d, and indicates the frequency that the data points are being collapsed into. A data point is an Excel date and a corresponding value, for instance 24-Dec-2010 with the value 123.45. In your Excel version, this date might be shown as 24/12/2010 (British English) or 12/24/2010 (US English) or in other formats, but internally there is no confusion, since the Excel dates are stored as values (technically the number of days since January 1, 1900). See the collapse example below. |
METHOD= |
(Optional: default = total). For Excel files using COLLAPSE, the METHOD option sets how the collapse (aggregation) is performed. Choose between total, avg, first, last, count, cf. also the COLLAPSE statement. Use "method=count" to check that data is being collapsed as desired, and note that "method=avg" amounts to "method=total" divided by "method=count". After collapsing into monthly or quarterly timeseries, X12A may be used for seasonal adjustment. |
DATEFORMAT= DATETYPE= |
(Optional). These options control the date format for .xlsx and .csv files, and for the use of SHEET in Gekcel. DATEFORMAT can be either 'gekko' (default, for instance 2020q3 or 2020m11) or a format string like 'yyyy-mm-dd', and the latter may contain a first or last indicator, for instance 'yyyy-mm-dd last', which indicates for quarterly, monthly or weekly data that the last day of the quarter/month/week is used. DATETYPE can be either 'text' or 'excel'. In the former case, the dates are understood as text strings (for instance '2020q3' or '2020-09-30' for a quarterly date), and in the latter case (not relevant for .csv files), the date is understood as an Excel date, which basically counts the days since January 1, 1900. This number would correspond to 44104 for the date 2020-09-31, and can be shown in Excel in different ways depending upon date format settings, language settings, etc., but the internal number itself is unambiguous. When using SHEET in Gekcel, using DATETYPE='excel' is advised for consistency. [New in 3.0.5]. |
•If no period is given inside the <...> angle brackets, the global period is used (cf. TIME).
Reading data from the file data.xlsx (spreadsheet) can be done with:
import <xlsx> data; |
or by the following:
import <xlsx> *; |
and then selecting the file. You can use paths etc.:
import <tsd> otherbanks\adam3; |
This will look for adam3.tsd in the subfolder 'otherbanks', relative the the Gekko working folder.
Use the TO keyword like this:
import <xlsx> forecst2 TO f2; |
This reads forecst2.xlsx into the named databank f2. After this, you may use for instance prt f2:gdp; to print out the timeseries gdp from this databank. You may use import <xlsx> forecst2 TO *; if you wish to use the filename as databank name. It is possible to use for instance import <xlsx> * TO *;.
Using IMPORT for csv or Excel files is only implemented for 'well-formed' spreadsheets. That is, with data starting in the first column and first row, and with either timeseries running left-to-right (normal for .csv files) or downwards (less normal). You may use import<csv cols> or import<xlsx cols>, if the timeseries are running downwards. If you need to pick out data from Excel cells more arbitrarily, see the SHEET<import> statement.
Excel data may be collapsed from higher frequencies than daily (for instance hourly observations), if the dates are represented as 'Date' types in Excel. Example:
import <xlsx sheet='oil' collapse=m> highfreq.xlsx; |
The Excel sheet might look like this:
|
Using this, it is expected that the timeseries run row-wise, with data starting at cell B2. Hence, the first date should be found at B1, and the dates should continue at cells B2, B3, etc. The first name should be at cell A2, and the names should continue at cells A3, A4, etc. In the example, the timeseries will be collapsed into monthly frequency, in the following way. For each date in the dates row, the month of this particular date is found, and the data is put into that particular month for each timeseries. Since the default method is 'total', the data is being summed for each month. You may use "collapse = q" or "collapse = a" to collapse directly into quarterly or annual data (this is better and simpler than using the COLLAPSE statement on the resulting monthly timeseries). For daily data, you may alternatively read these directly with normal IMPORT, and later use COLLAPSE to collapse them into monthly timeseries.
If needed, the X12A statement can then be used for seasonal adjustments. If you only want to obtain parts of the timeseries, you may restrict with a time period, for instance:
import <2000m1 2018m5 xlsx sheet='oil' collapse = m> highfreq.xlsx; |
If you prefer averages, use "method=avg":
import <xlsx sheet='oil' collapse=m method=avg> highfreq.xlsx; |
To check that there is a similar number of data points for each month, you may use "method=count" and print/plot the resulting series to check this (particularly relevant regarding the start and end of the range of Excel dates). As noted above, "method=avg" amounts to "method=total" divided by "method=count".
The data does not need to start at cell B2. If, for instance, the first data cell is at G10, you may use:
import <xlsx sheet='oil' cell='g10' collapse=m> highfreq.xlsx; |
Here, Gekko will expect the first date cell to be at G9, and the first name cell to be at F10. If there are rows/cols between the dates/names and the data cells, you may indicate the precise location of the dates/names:
import <xlsx sheet='oil' cell='g10' datecell='g1' namecell='a10' collapse=m> highfreq.xlsx; |
In this particular case, the dates/names are located in the first row and column of the spreadsheet. If the timeseries run downwards in columns, you may use <cols> for transposed importing:
import <xlsx cols sheet='oil' collapse=m> highfreq.xlsx; |
Note: when using this functionality, you may 'collapse' for instance monthly data into its own frequency. In that case, using <method=count> should produce timeseries with value 1, indicating that there is only 1 observation for each month (otherwise something is wrong regarding the Excel sheet). Note also that dates in Excel are represented as the number of days since January 1, 1900. These dates may contain fractions, so 1 hour is represented by 1/24, etc. Keep this in mind if you are using <datecell=...>. If this points to a sequence of numbers that are not dates, these numbers may be erroneously interpreted as dates!
You may use a setup like the following to transfer timeseries with high precision from AREMOS (the tsd format has rather limited precision).
!In AREMOS dump1 #banker; |
This will produce the three files bank1.dmp, bank2.dmp, and bank3.dmp.
//In Gekko |
This will produce the three files bank1.gbk, bank2.gbk, and bank3.gbk. Only timeseries can be transferred like this.
To convert a .tsd file or other formats into a .gbk file, just import it with IMPORT<tsd>, and WRITE it. Please note that a .tsd file operates with 8 significant digits (or less), so there will typically be a loss of precision compared to a .gbk file (which is in double-precision).
The option copylocal below copies the targeted file to a temporary file on the user's local hard disk before reading. This copying is typically very fast, and afterwards reading the temporary file is faster and more reliable, if the targeted file is located on a network drive. In general, this is a recommended option that alleviates some potential network problems.
The cache = nongbk option is recommended, too. It does the following: (a) it finds the data file, (b) it optionally copies it to the user's local hard disk, and (c) it calculates a MD5 hash code from the file. If this hash code is already known, Gekko uses a cache file representing the databank data, which in most cases is much faster to read (it is in protobuf format: the same format that is used for .gbk files). So there is a small speed penalty for calculating the hash, and a large speed gain for using the protobuf/cache file.
See the Gekko menu 'Options' --> 'Program dependency tracking' or use option global dependency tracking = ...; to activate dependency tracking, so that the use of external files (for instance program files, read/written databanks etc.) are shown as a list at the end of a Gekko session.
OPTION databank file cache = nongbk; [all|nongbk|none]
OPTION databank file copylocal = yes;
OPTION folder bank = [empty];
OPTION folder bank1 = [empty];
OPTION folder bank2 = [empty];
OPTION gams exe folder = [empty];
OPTION gams fast = yes;
OPTION gams time set = 't';
OPTION gams time prefix = '';
OPTION gams time offset = 0;
OPTION gams time detect auto = no;
OPTION gams trim = 0; //can ignore variables/parameters with few elements.
OPTION sheet engine = internal; //use 'excel' for the older .xls format
READ, WRITE, EXPORT, OPEN, CLONE, DOWNLOAD, COLLAPSE