We are on the lookout for a DevOps System Engineer. You are familiar with Linux, Docker, Git, CI/CD and Ansible. We offer interesting projects around the R ecosystem and a lot of freedom. Some on-site time will be required, but a lot of remote work is possible as well. This position is especially attractive if you are interested in data science and want to learn more about R.
Responsibility for enhancing and maintaining our hosted RStudio service products and our internal infrastructure
Communicate with IT departments of our clients
Set up and maintain infrastructure as code
Very good knowledge of Linux, Docker, Git, CI/CD and Ansible.
Ability and desire to learn and improve on the job
Very good command of written and spoken German
Good working knowledge of written and spoken English
An interest in R and data science in general is a plus.
An open source friendly environment that encouranges community contribution
Interesting projects around consulting and open source software development
Offices in Zurich: at Stauffacher and near ETH Hönggerberg
Flexible working hours with the possibility to work from home
Please submit your application via firstname.lastname@example.org. Get in touch with us if you have further questions.
cynkra is a Zurich-based data consulting company with a strong focus on R. We use R and the tidyverse in the vast majority of our projects. We are an RStudio Full Certified Partner.
We support businessess and organizations by helping them picking the right tools, implementing solutions, training and code review. We are enthusiastic about open source software and contribute to a large number of R packages. Learn more at www.cynkra.com.
When committing to a Git repository related to my consulting work, it’s very important for me to use my company e-mail address, email@example.com . Not so much for my open-source work – for this I prefer other addresses, like firstname.lastname@example.org . (For example, Travis CI sends notification e-mails to the committer’s e-mail address, and I have set up filtering for that other address.)
Configuring the e-mail address for each repository separately gets annoying very soon. Rather, I’d like all repos in a specific subdirectory to use a specific e-mail address.
All my Git repos live in
cynkra, respectively, contain R packages and repos related to consulting. To achieve the desired setup, I edit my
~/.gitconfig to contain the following entry:
[includeIf "gitdir:git/**"] path = git/.gitconfig
This makes sure that all repos that live underneath the
git directory use the
git/.gitconfig file in addition to the main configuration. That file contains the following:
[includeIf "gitdir:R/**"] path = R/.gitconfig [includeIf "gitdir:cynkra/**"] path = cynkra/.gitconfig
~/git/cynkra/.gitconfig, I configure the e-mail addresses I want to use for all repos underneath
[user] email = ...
I verify the setup with
git config -l | grep user. Indeed, cynkra repos use the cynkra e-mail address. Voilà!
This requires a recent-ish version of
git, version 2.14 or later should suffice. Read more about conditional includes.
Angel D'az, Kirill Müller
Relational databases are a very powerful tool for analyzing and manipulating data. However, many modeling workflows require a great deal of time and effort for wrangling data from databases into a flat data frame or table format. Only then the actual data analysis can start.
With the proper tools, analysis can begin using a relational model that works directly with the database and, if wrangling is still required, R users can leverage the powerful and proven SQL approach to data organization and manipulation.
The dm package takes the primary advantage of databases – relational modeling – and brings it to R. In relational databases, tables or data frames are linked through primary keys (PK) and foreign keys (FK). Instead of having a single wide table to work with, data is segmented across multiple tables for the purpose of eliminating, or at least reducing, redundancies. This process is called normalization. A classic example is storing unique identifiers in a large table and look up values for these unique identifiers in a smaller data frame. This lets an analysis run without reading or writing look up values until necessary, because unique identifiers are enough for most of the runtime.
dm 0.1.1 is available on CRAN. You can now download and install dm, from CRAN, with the following command:
We connect to a relational dataset repository with a database server that is publicly accessible without registration.
There is a financial dataset that contains loan data, along with relevant information and transactions.
We chose this loan dataset because the relationships between
transcactions tables is a good representation of databases that record real-world business transactions.
The dataset page lists the credentials required for connecting to the database:
These can be used in e.g. MySQL Workbench to download the CSV data manually. To automate and keep the data on the database for as long as possible, we connect to the database from R through its database interface to access the tables:
library(RMariaDB) my_db <- dbConnect( MariaDB(), user = 'guest', password = 'relational', dbname = 'Financial_ijs', host = 'relational.fit.cvut.cz' ) dbListTables(my_db) ##  "accounts" "cards" "clients" "disps" "districts" "loans" ##  "orders" "tkeys" "trans"
By creating a dm object from the connection, we get access to all tables:
library(dm) my_dm <- dm_from_src(my_db) my_dm ## ── Table source ──────────────────────────────────────────────────────────────── ## src: mysql [email@example.com:NA/Financial_ijs] ## ── Metadata ──────────────────────────────────────────────────────────────────── ## Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total) ## Columns: 57 ## Primary keys: 0 ## Foreign keys: 0
names(my_dm) ##  "accounts" "cards" "clients" "disps" "districts" "loans" ##  "orders" "tkeys" "trans"
my_dm$accounts ## # Source: table<accounts> [?? x 4] ## # Database: mysql [firstname.lastname@example.org:NA/Financial_ijs] ## id district_id frequency date ## <int> <int> <chr> <date> ## 1 1 18 POPLATEK MESICNE 1995-03-24 ## 2 2 1 POPLATEK MESICNE 1993-02-26 ## 3 3 5 POPLATEK MESICNE 1997-07-07 ## 4 4 12 POPLATEK MESICNE 1996-02-21 ## 5 5 15 POPLATEK MESICNE 1997-05-30 ## 6 6 51 POPLATEK MESICNE 1994-09-27 ## # … with more rows
The components of this particular dm object are lazy tables powered by dbplyr. This package translates the dplyr grammar of data manipulation into queries the database server understands. The advantage to a lazy table is that there is no data download until results are collected for printing or local processing. Below, the summary operation is computed on the database and only the results are sent back to the R session.
library(dplyr) my_dm$accounts %>% group_by(district_id) %>% summarize(n = n()) %>% ungroup() ## # Source: lazy query [?? x 2] ## # Database: mysql [email@example.com:NA/Financial_ijs] ## district_id n ## <int> <int64> ## 1 1 554 ## 2 2 42 ## 3 3 50 ## 4 4 48 ## 5 5 65 ## 6 6 48 ## # … with more rows
If the data fits into your RAM, a database connection is not required to use dm.
collect() verb downloads all tables for our dm object.
my_local_dm <- my_dm %>% collect() object.size(my_local_dm) ## 77922024 bytes
my_local_dm$accounts %>% group_by(district_id) %>% summarize(n = n()) %>% ungroup() ## # A tibble: 77 x 2 ## district_id n ## <int> <int> ## 1 1 554 ## 2 2 42 ## 3 3 50 ## 4 4 48 ## 5 5 65 ## 6 6 48 ## # … with 71 more rows
A dm object can also be created from individual data frames with the
Primary and foreign keys are how relational database tables are linked with each other. The model diagram provided by our test database illustrates the intended relationships.
However, it turns out this is not an accurate representation of the entities and relationships within the database:
tkeystable available in the database that is not listed in the model diagram.
Financial_stddatabase is similar, but different to the one that we work with,
Bearing these discrepancies in mind, we can define suitable primary and foreign keys for our dm object.
The documentation suggests that the
loans table is the most important one.
We color the target table separately with
# Defining PKs and FKs my_dm_keys <- my_local_dm %>% dm_add_pk(districts, id) %>% dm_add_pk(accounts, id) %>% dm_add_pk(clients, id) %>% dm_add_pk(loans, id) %>% dm_add_pk(orders, id) %>% dm_add_pk(trans, id) %>% dm_add_pk(disps, id) %>% dm_add_pk(cards, id) %>% dm_add_fk(loans, account_id, accounts) %>% dm_add_fk(orders, account_id, accounts) %>% dm_add_fk(trans, account_id, accounts) %>% dm_add_fk(disps, account_id, accounts) %>% dm_add_fk(disps, client_id, clients) %>% dm_add_fk(accounts, district_id, districts) %>% dm_add_fk(cards, disp_id, disps) %>% dm_set_colors(green = loans) # Draw the visual model my_dm_keys %>% dm_draw()
The discrepancies highlight the importance of being able to define primary and foreign keys. Most struggles in manipulating data are not syntax knowledge gaps. Syntax can always be looked up with search engines. Knowledge gaps about how data is organized is a much more common stumbling block to R users when working with distributed data.
Insight into the structure of a database using the built-in
dm_draw() function provides an an instant efficiency boost.
Combined with defining unique identifiers (primary keys) and how they are found by other tables (foreign keys), an R user can quickly clarify the structures that they are working with.
To assist with this process of defining structure, dm comes with a built-in helper to check referential integrity of the dataset:
my_dm_keys %>% dm_examine_constraints() ## ℹ All constraints satisfied.
For modeling, a flat table or matrix is required as input. If normalization is the process of splitting up a table to reduce redundancies, joining multiple tables together is called denormalizing.
dm_squash_to_tbl() function creates a denormalized table by performing a cascading join between
cards and all outgoing foreign keys.
join is the SQL term for combining some or all of the unique columns between 2 or more tables into a single table using the appropriate keys.
In this case, the
cards table has a foreign key to
disps table, which has a foreign key to
accounts, which also has a foreign key to the
These foreign key relationships are then used in a cascading join within the
dm_squash_to_tbl() function, without having to specify the relationships because they are already encoded within the
my_dm_keys %>% dm_squash_to_tbl(cards) ## Renamed columns: ## * type -> cards$cards.type, disps$disps.type ## * district_id -> accounts$accounts.district_id, clients$clients.district_id ## # A tibble: 892 x 28 ## id disp_id cards.type issued client_id account_id disps.type ## <int> <int> <chr> <date> <int> <int> <chr> ## 1 1 9 gold 1998-10-16 9 7 OWNER ## 2 2 19 classic 1998-03-13 19 14 OWNER ## 3 3 41 gold 1995-09-03 41 33 OWNER ## 4 4 42 classic 1998-11-26 42 34 OWNER ## 5 5 51 junior 1995-04-24 51 43 OWNER ## 6 7 56 classic 1998-06-11 56 48 OWNER ## # … with 886 more rows, and 21 more variables: accounts.district_id <int>, ## # frequency <chr>, date <date>, A2 <chr>, A3 <chr>, A4 <int>, A5 <int>, ## # A6 <int>, A7 <int>, A8 <int>, A9 <int>, A10 <dbl>, A11 <int>, A12 <dbl>, ## # A13 <dbl>, A14 <int>, A15 <int>, A16 <int>, birth_number <chr>, ## # clients.district_id <int>, tkey_id <int>
We have an analysis-ready dataset available to use!
Data transformation in dm is done by zooming on the table you would like to work with. A zoomed dm supports dplyr operations on the zoomed table: simple transformations, grouped operations, joins, and more.
my_dm_keys %>% dm_zoom_to(accounts) %>% group_by(district_id) %>% summarize(n = n()) %>% ungroup() %>% left_join(districts) ## # Zoomed table: accounts ## # A tibble: 77 x 17 ## district_id n A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 ## <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <dbl> <int> ## 1 1 554 Hl.m… Prag… 1.20e6 0 0 0 1 1 100 12541 ## 2 2 42 Bene… cent… 8.89e4 80 26 6 2 5 47 8507 ## 3 3 50 Bero… cent… 7.52e4 55 26 4 1 5 42 8980 ## 4 4 48 Klad… cent… 1.50e5 63 29 6 2 6 67 9753 ## 5 5 65 Kolin cent… 9.56e4 65 30 4 1 6 51 9307 ## 6 6 48 Kutn… cent… 7.80e4 60 23 4 2 4 52 8546 ## # … with 71 more rows, and 5 more variables: A12 <dbl>, A13 <dbl>, A14 <int>, ## # A15 <int>, A16 <int>
The columns used by
left_join() to consolidate tables are inferred from the primary and foreign keys already encoded within the
Walking through dm’s data modeling fundamentals, in particular adding keys and drawing the structure, will help R users better understand data from external databases, or apply best practices from relational data modeling to their local data.
cynkra teamMaintaining multiple identities with Git
Kirill MüllerRelational data models in R
Angel D'az, Kirill Müllertempdisagg: converting quarterly time series to daily
Christoph Saxtsbox 0.2: supporting additional time series classes
Christoph SaxDevOps System Engineer (40-60%)
cynkra teamIntroducing dm: easy juggling of tables and relations
Balthasar Sagertsbox 0.1: class-agnostic time series
Christoph SaxData Scientist/Engineer (40-100%)
cynkra teamTime series of the world, unite!
Christoph SaxDone “Establishing DBI”!?