Relational data models in R

Angel D'az, Kirill Müller


Relational databases are a very powerful tool for analyzing and manipulating data. However, many modeling workflows require a great deal of time and effort for wrangling data from databases into a flat data frame or table format. Only then the actual data analysis can start.

Photo by Jordan McDonald


Why a relational model?

With the proper tools, analysis can begin using a relational model that works directly with the database and, if wrangling is still required, R users can leverage the powerful and proven SQL approach to data organization and manipulation.

The dm package takes the primary advantage of databases – relational modeling – and brings it to R. In relational databases, tables or data frames are linked through primary keys (PK) and foreign keys (FK). Instead of having a single wide table to work with, data is segmented across multiple tables for the purpose of eliminating, or at least reducing, redundancies. This process is called normalization. A classic example is storing unique identifiers in a large table and look up values for these unique identifiers in a smaller data frame. This lets an analysis run without reading or writing look up values until necessary, because unique identifiers are enough for most of the runtime.

dm 0.1.1 is available on CRAN. You can now download and install dm, from CRAN, with the following command:

install.packages("dm")

Connect to a database

We connect to a relational dataset repository with a database server that is publicly accessible without registration. There is a financial dataset that contains loan data, along with relevant information and transactions. We chose this loan dataset because the relationships between loan, account, transcactions tables is a good representation of databases that record real-world business transactions.

The dataset page lists the credentials required for connecting to the database:

  • hostname: relational.fit.cvut.cz
  • port: 3306
  • username: guest
  • password: relational
  • database: Financial_ijs

These can be used in e.g. MySQL Workbench to download the CSV data manually. To automate and keep the data on the database for as long as possible, we connect to the database from R through its database interface to access the tables:

library(RMariaDB)
my_db <- dbConnect(
  MariaDB(),
  user = 'guest',
  password = 'relational',
  dbname = 'Financial_ijs',
  host = 'relational.fit.cvut.cz'
)
dbListTables(my_db)

## [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"
## [7] "orders"    "tkeys"     "trans"

By creating a dm object from the connection, we get access to all tables:

library(dm)
my_dm <- dm_from_src(my_db)

my_dm

## ── Table source ────────────────────────────────────────────────────────────────
## src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
## ── Metadata ────────────────────────────────────────────────────────────────────
## Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
## Columns: 57
## Primary keys: 0
## Foreign keys: 0
names(my_dm)

## [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"
## [7] "orders"    "tkeys"     "trans"
my_dm$accounts

## # Source:   table<accounts> [?? x 4]
## # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
##      id district_id frequency        date
##   <int>       <int> <chr>            <date>
## 1     1          18 POPLATEK MESICNE 1995-03-24
## 2     2           1 POPLATEK MESICNE 1993-02-26
## 3     3           5 POPLATEK MESICNE 1997-07-07
## 4     4          12 POPLATEK MESICNE 1996-02-21
## 5     5          15 POPLATEK MESICNE 1997-05-30
## 6     6          51 POPLATEK MESICNE 1994-09-27
## # … with more rows

The components of this particular dm object are lazy tables powered by dbplyr. This package translates the dplyr grammar of data manipulation into queries the database server understands. The advantage to a lazy table is that there is no data download until results are collected for printing or local processing. Below, the summary operation is computed on the database and only the results are sent back to the R session.

library(dplyr)
my_dm$accounts %>%
  group_by(district_id) %>%
  summarize(n = n()) %>%
  ungroup()

## # Source:   lazy query [?? x 2]
## # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
##   district_id n
##         <int> <int64>
## 1           1 554
## 2           2  42
## 3           3  50
## 4           4  48
## 5           5  65
## 6           6  48
## # … with more rows

If the data fits into your RAM, a database connection is not required to use dm. The collect() verb downloads all tables for our dm object.

my_local_dm <-
  my_dm %>%
  collect()

object.size(my_local_dm)
## 77922024 bytes
my_local_dm$accounts %>%
  group_by(district_id) %>%
  summarize(n = n()) %>%
  ungroup()

## # A tibble: 77 x 2
##   district_id     n
##         <int> <int>
## 1           1   554
## 2           2    42
## 3           3    50
## 4           4    48
## 5           5    65
## 6           6    48
## # … with 71 more rows

A dm object can also be created from individual data frames with the dm() function.

Define primary and foreign keys

Primary and foreign keys are how relational database tables are linked with each other. The model diagram provided by our test database illustrates the intended relationships.

However, it turns out this is not an accurate representation of the entities and relationships within the database:

  • Table names in our database have the plural form, in the diagram it’s singular.
  • There is a tkeys table available in the database that is not listed in the model diagram.
  • The Financial_std database is similar, but different to the one that we work with, Financial_ijs.

Bearing these discrepancies in mind, we can define suitable primary and foreign keys for our dm object. The documentation suggests that the loans table is the most important one. We color the target table separately with dm_color().

# Defining PKs and FKs
my_dm_keys <-
  my_local_dm %>%
  dm_add_pk(districts, id) %>%
  dm_add_pk(accounts, id) %>%
  dm_add_pk(clients, id) %>%
  dm_add_pk(loans, id) %>%
  dm_add_pk(orders, id) %>%
  dm_add_pk(trans, id) %>%
  dm_add_pk(disps, id) %>%
  dm_add_pk(cards, id) %>%
  dm_add_fk(loans, account_id, accounts) %>%
  dm_add_fk(orders, account_id, accounts) %>%
  dm_add_fk(trans, account_id, accounts) %>%
  dm_add_fk(disps, account_id, accounts) %>%
  dm_add_fk(disps, client_id, clients) %>%
  dm_add_fk(accounts, district_id, districts) %>%
  dm_add_fk(cards, disp_id, disps) %>%
  dm_set_colors(green = loans)

# Draw the visual model
my_dm_keys %>%
  dm_draw()

The discrepancies highlight the importance of being able to define primary and foreign keys. Most struggles in manipulating data are not syntax knowledge gaps. Syntax can always be looked up with search engines. Knowledge gaps about how data is organized is a much more common stumbling block to R users when working with distributed data.

Insight into the structure of a database using the built-in dm_draw() function provides an an instant efficiency boost. Combined with defining unique identifiers (primary keys) and how they are found by other tables (foreign keys), an R user can quickly clarify the structures that they are working with.

To assist with this process of defining structure, dm comes with a built-in helper to check referential integrity of the dataset:

my_dm_keys %>%
  dm_examine_constraints()

## ℹ All constraints satisfied.

Create a dataset ready for analysis

For modeling, a flat table or matrix is required as input. If normalization is the process of splitting up a table to reduce redundancies, joining multiple tables together is called denormalizing.

The dm_squash_to_tbl() function creates a denormalized table by performing a cascading join between cards and all outgoing foreign keys. A join is the SQL term for combining some or all of the unique columns between 2 or more tables into a single table using the appropriate keys. In this case, the cards table has a foreign key to disps table, which has a foreign key to accounts, which also has a foreign key to the districts table. These foreign key relationships are then used in a cascading join within the dm_squash_to_tbl() function, without having to specify the relationships because they are already encoded within the dm object.

my_dm_keys %>%
  dm_squash_to_tbl(cards)

## Renamed columns:
## * type -> cards$cards.type, disps$disps.type
## * district_id -> accounts$accounts.district_id, clients$clients.district_id

## # A tibble: 892 x 28
##      id disp_id cards.type issued     client_id account_id disps.type
##   <int>   <int> <chr>      <date>         <int>      <int> <chr>
## 1     1       9 gold       1998-10-16         9          7 OWNER
## 2     2      19 classic    1998-03-13        19         14 OWNER
## 3     3      41 gold       1995-09-03        41         33 OWNER
## 4     4      42 classic    1998-11-26        42         34 OWNER
## 5     5      51 junior     1995-04-24        51         43 OWNER
## 6     7      56 classic    1998-06-11        56         48 OWNER
## # … with 886 more rows, and 21 more variables: accounts.district_id <int>,
## #   frequency <chr>, date <date>, A2 <chr>, A3 <chr>, A4 <int>, A5 <int>,
## #   A6 <int>, A7 <int>, A8 <int>, A9 <int>, A10 <dbl>, A11 <int>, A12 <dbl>,
## #   A13 <dbl>, A14 <int>, A15 <int>, A16 <int>, birth_number <chr>,
## #   clients.district_id <int>, tkey_id <int>

We have an analysis-ready dataset available to use!

Transform data in a dm

Data transformation in dm is done by zooming on the table you would like to work with. A zoomed dm supports dplyr operations on the zoomed table: simple transformations, grouped operations, joins, and more.

my_dm_keys %>%
  dm_zoom_to(accounts) %>%
  group_by(district_id) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  left_join(districts)

## # Zoomed table: accounts
## # A tibble:     77 x 17
##   district_id     n A2    A3        A4    A5    A6    A7    A8    A9   A10   A11
##         <int> <int> <chr> <chr>  <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1           1   554 Hl.m… Prag… 1.20e6     0     0     0     1     1   100 12541
## 2           2    42 Bene… cent… 8.89e4    80    26     6     2     5    47  8507
## 3           3    50 Bero… cent… 7.52e4    55    26     4     1     5    42  8980
## 4           4    48 Klad… cent… 1.50e5    63    29     6     2     6    67  9753
## 5           5    65 Kolin cent… 9.56e4    65    30     4     1     6    51  9307
## 6           6    48 Kutn… cent… 7.80e4    60    23     4     2     4    52  8546
## # … with 71 more rows, and 5 more variables: A12 <dbl>, A13 <dbl>, A14 <int>,
## #   A15 <int>, A16 <int>

The columns used by left_join() to consolidate tables are inferred from the primary and foreign keys already encoded within the dm object.

Reproducible dataflows with dm

Walking through dm’s data modeling fundamentals, in particular adding keys and drawing the structure, will help R users better understand data from external databases, or apply best practices from relational data modeling to their local data.

You can immediately start testing on an Rstudio cloud instance! For more examples and explanations, check out the documentation page. Install this package today!





tempdisagg: converting quarterly time series to daily

Christoph Sax


Not having a time series at the desired frequency is a common problem for researchers and analysts. For example, instead of quarterly sales, they only have annual sales. Instead of a daily stock market index, they only have a weekly index. While there is no way to fully make up for the missing data, there are useful workarounds: with the help of one or more high frequency indicator series, the low frequency series may be disaggregated into a high frequency series.

Photo by Jordan McDonald


The package tempdisagg implements the standard methods for temporal disaggregation: Denton, Denton-Cholette, Chow-Lin, Fernandez and Litterman. Our article on temporal disaggregation of time series in the R-Journal describes the package and the theory of temporal disaggregation in more detail.

The package has been around since eight years, enabling the standard year or quarter to month or quarter disaggregation. With version 1.0, there are now some major new features: disaggregation can be performed from any frequency to any frequency. Also, tempdisagg now supports time series classes other than ts.

Convert between any frequency

tempdisagg can now convert between most frequencies, e.g., it can disaggregate a monthly series to daily. It is no longer restricted to regular conversions, where each low frequency period had the same number of high frequency periods. Instead, a low frequency period (e.g. month) can contain any number of high-frequency periods (e.g. 31, 28 or 29 days). Thanks to Roger Kissling and Stella Sim who have suggested this idea.

We can not only convert months to days, but also years to days, weeks to seconds, or academic years to seconds, or lunar years to hours, … The downside is that the computation time depends on the number of observations. Thus, for longer high-frequency series, the computation may take a while.

In the following, we try to disaggregate quarterly GDP of Switzerland to a hypothetical daily GDP series. The example series are shipped with the package.

library(tempdisagg)
data(tempdisagg)
head(gdp.q)
##         time    value
## 1 2005-01-01 133101.3
## 2 2005-04-01 136320.4
## 3 2005-07-01 137693.7
## 4 2005-10-01 139475.9
## 5 2006-01-01 139204.7
## 6 2006-04-01 141112.5

Time series can be stored in data frames

Because we are dealing with daily data, we keep the data in a data.frame, rather than in a ts object. Other time series objects, such as xts and tsibble, are possible as well. For conversion and visualization, we use the tsbox package.

library(tsbox)
ts_plot(gdp.q, title = "Swiss GDP", subtitle = "real, not seasonally adjusted")

Series to disaggregate: quarterly gross domestic product of Switzerland


Disaggregation to daily frequency

While disaggregation can also be performed without other series, we use Swiss stock market data as an indicator series to disaggregate GDP. Data of the stock market index, the SMI, is also included in tempdisagg. Weekend and holiday values have been interpolated, because td does not allow the presence of missing values.

ts_plot(spi.d, title = "Swiss Performance Index", subtitle = "daily values, interpolated")

Daily indicator series: Swiss Performance Index


The following uses the Chow-Lin method to disaggregate the series. A high rho parameter takes into account that the two series are unlikely to be co-integrated.

m.d.stocks <- td(gdp.q ~ spi.d, method = "chow-lin-fixed", fixed.rho = 0.9)
summary(m.d.stocks)
##
## Call:
## td(formula = gdp.q ~ spi.d, method = "chow-lin-fixed", fixed.rho = 0.9)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -10656  -1760   1076   3796   8891
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.320e+03  2.856e+01   46.22   <2e-16 ***
## spi.d       5.512e-02  3.735e-03   14.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## 'chow-lin-fixed' disaggregation with 'sum' conversion
## 59 low-freq. obs. converted to 5493 high-freq. obs.
## Adjusted R-squared: 0.7928 AR1-Parameter:   0.9

And here is the result: A daily series of GDP

gdp.d.stocks <- predict(m.d.stocks)
ts_plot(
  ts_scale(
    ts_c(gdp.d.stocks, gdp.q)
  ),
  title = "Daily disaggregated GDP",
  subtitle = "one indicator"
)

Swiss GDP, disaggregated to daily


Like with all disaggregation methods in tempdisagg, the resulting series fulfills the aggregation constraint (the resulting series is as long as the indicator, and needs to be shortened for a comparison):

all.equal(
  ts_span(
    ts_frequency(gdp.d.stocks, "quarter", aggregate = "sum"),
    end = "2019-07-01"
  ),
  gdp.q
)
## [1] TRUE




tsbox 0.2: supporting additional time series classes

Christoph Sax


The tsbox package makes life with time series in R easier. It is built around a set of functions that convert time series of different classes to each other. They are frequency-agnostic, and allow the user to combine time series of multiple non-standard and irregular frequencies. A detailed overview of the package functionality is given in the documentation page (or in a previous blog-post).

Version 0.2 is now on CRAN and provides a larger number of bugfixes. Non-standard column names are now handled correctly, and non-standard column orders are treated consistently.

New Classes

There are two more time series classes supported: tis time series, from the tis package, and irts time series, from the tseries package.

In order to crate an object of these classes, it is sufficient to use the appropriate converter.

E.g., for tis time series:

library(tsbox)
ts_tis(fdeaths)
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1974  901  689  827  677  522  406  441  393  387  582  578  666
## 1975  830  752  785  664  467  438  421  412  343  440  531  771
## 1976  767 1141  896  532  447  420  376  330  357  445  546  764
## 1977  862  660  663  643  502  392  411  348  387  385  411  638
## 1978  796  853  737  546  530  446  431  362  387  430  425  679
## 1979  821  785  727  612  478  429  405  379  393  411  487  574
## class: tis

Or for irts time series:

head(ts_irts(fdeaths))
## 1974-01-01 00:00:00 GMT 901
## 1974-02-01 00:00:00 GMT 689

Conversion works from all classes to all classes, and we can easily convert these objects to any other time series class, or to a data frame:

x.tis <- ts_tis(fdeaths)
head(ts_df(x.tis))
##         time value
## 1 1974-01-01   901
## 2 1974-02-01   689
## 3 1974-03-01   827
## 4 1974-04-01   677
## 5 1974-05-01   522
## 6 1974-06-01   406

Class-agnostic functions

Because coercion works reliably and is well tested, we can use it to make functions class-agnostic. If a class-agnostic function works for one class, it works for all:

ts_pc(ts_tis(fdeaths))
ts_pc(ts_irts(fdeaths))
ts_pc(ts_df(fdeaths))
ts_pc(fdeaths)

ts_pc calculates percentage change rates towards the previous period. It works like a ‘generic’ function: You can apply it on any time series object, and it will return an object of the same class as its input.

So, whether we want to smooth, scale, differentiate, chain-link, forecast, regularize or seasonally adjust a series, we can use the same commands to all time series classes. tsbox offers a comprehensive toolkit for the basics of time series manipulation. Here are some additional examples:

ts_pcy(fdeaths)                # p.c., compared do same period of prev. year
ts_forecast(fdeaths)           # forecast, by exponential smoothing
ts_seas(fdeaths)               # seasonal adjustment, by X-13
ts_frequency(fdeaths, "year")  # convert to annual frequency
ts_span(fdeaths, "-1 year")    # limit time span to final year

There are many more. Because they all start with ts_, you can use auto-complete to see what’s around. Most conveniently, there is a time series plot function that works for all classes and frequencies:

ts_plot(
  `Airline Passengers` = AirPassengers,
  `Lynx trappings` = ts_tis(lynx),
  `Deaths from Lung Diseases` = ts_xts(fdeaths),
  title = "Airlines, trappings, and deaths",
  subtitle = "Monthly passengers, annual trappings, monthly deaths"
)

time series plot









More posts

Relational data models in R

Angel D'az, Kirill Müller

tempdisagg: converting quarterly time series to daily

Christoph Sax

tsbox 0.2: supporting additional time series classes

Christoph Sax

DevOps System Engineer (40-60%)

cynkra team

Introducing dm: easy juggling of tables and relations

Balthasar Sager

tsbox 0.1: class-agnostic time series

Christoph Sax

Data Scientist/Engineer (40-100%)

cynkra team

Time series of the world, unite!

Christoph Sax

Done “Establishing DBI”!?

Kirill Müller


Other blogs

R-bloggers