Baptiste Maingret’s Homepage

Using 1password, GPG and git for seamless commits signing

2022-02-15T00:00:00+01:00

Setting up gpg, git and 1password to have your git commits signed, whilestoring your GPG key passphrase into 1Password and unlocking it directly from your terminal.

Requirements and setup
Creating a GPG key
Setting up git
Setting up GitHub
Setting up 1Password CLI
Configuring gpg-agent
Putting it all together
Testing it
Sources

Requirements and setup

This was done on WSL2.

❯ uname -a
Linux DESKTOP-AGPN69M 5.10.60.1-microsoft-standard-WSL2 #1 SMP Wed Aug 25 23:20:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
❯ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"

Make sure you have git and gpg installed, and an active 1Password account.

You’ll find written to use gpg2 when available on your system, but in my case both gpg and gpg2 were pointed to the same bin.

❯ ls -l "$(which gpg)"
.rwxr-xr-x 1.1M root  6 Jan  2021 /usr/bin/gpg
❯ ls -l "$(which gpg2)"
lrwxrwxrwx 3 root  6 Jan  2021 /usr/bin/gpg2 -> gpg
❯ gpg --version
gpg (GnuPG) 2.2.19

I will use jq but this is not essential.

❯ jq --version
jq-1.6

Creating a GPG key

This is a summary of GitHub - Generating a new GPG key, that I followed.

Run the following command and follow instructions.

❯ gpg --full-generate-key

Then list your fresh key.

❯ gpg --list-secret-keys --keyid-format=long
/home/baptiste/.gnupg/pubring.kbx
---------------------------------
sec   rsa4096/0052A8D354A5C655 2022-02-09 [SC]
      9BA03414AB56590B6DB5369F0052A8D354A5C655
uid                 [ultimate] Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com>
ssb   rsa4096/A5B8C64E8929B475 2022-02-09 [E]

Look at the sec line and note the GPG key ID: 0052A8D354A5C655.

Then we export the corresponding public key.

❯ gpg --armor --export 0052A8D354A5C655
-----BEGIN PGP PUBLIC KEY BLOCK-----
# your public key
-----END PGP PUBLIC KEY BLOCK-----

Copy everything including the starting and ending blocks.

Setting up git

First let’s tell git which key to use. Using your GPG key ID, run:

❯ git config --global user.signingkey 0052A8D354A5C655

N.B This will configure it globally, you may need to configure it per repository depending on your usage.

Then we will tell git to sign every commit of every repository.

❯ git config --global commit.gpgsign true

Setting up GitHub

Instructions may change. Check online documentation Adding a new GPG key to your GitHub account.

TL;DR. Settings > Access > New GPG key

Setting up 1Password CLI

Instructions may change. Check online documentation. 1Password CLI: Getting started

N.B. Version is hardcoded in URL, so make sure to check the official website to get latest URL.

❯ curl -S https://cache.agilebits.com/dist/1P/op/pkg/v1.12.4/op_linux_amd64_v1.12.4.zip > op.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3810k  100 3810k    0     0  4897k      0 --:--:-- --:--:-- --:--:-- 4891k

❯ unzip op.zip -d op
Archive:  op.zip
 extracting: op/op.sig
  inflating: op/op

❯ gpg --keyserver hkps://keyserver.ubuntu.com --receive-keys 3FEF9748469ADBE15DA7CA80AC2D62742012EA22
gpg: key AC2D62742012EA22: public key "Code signing for 1Password <codesign@1password.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  

❯ gpg --verify op/op.sig op/op
gpg: Signature made Fri Jan 14 22:38:08 2022 CET
gpg:                using RSA key 3FEF9748469ADBE15DA7CA80AC2D62742012EA22
gpg: Good signature from "Code signing for 1Password <codesign@1password.com>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: 3FEF 9748 469A DBE1 5DA7  CA80 AC2D 6274 2012 EA22

❯ sudo mv op /usr/bin

❯ op --version
1.12.4

Try to sign in.

op signin my.1password.com your.email@example.com

Configuring gpg-agent

We will make use of gpg-preset-passphrase to cache our passphrase for our key. For that we need to make sure gpg-agent allows it.

❯ echo "allow-preset-passphrase" >> ~/.gnupg/gpg-agent.conf

Putting it all together

Finding your 1Password entry

I will assume you have a 1Password entry storing your GPG key passphrase, with the name GPG passphrase.

❯ op get item "GPG passphrase" | jq ".uuid"
"vmgevmdnbbuui3evhksdftjhju"

Getting your GPG key grip

We list our keys and their key grips.

❯ gpg --list-secret-keys --with-keygrip
/home/baptiste/.gnupg/pubring.kbx
---------------------------------
sec   rsa4096 2022-02-09 [SC]
      9BA03414AB56590B6DB5369F0052A8D354A5C655
      Keygrip = 80160C5055DA07978E939C0575A4E8DA0B1ECF27
uid           [ultimate] Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com>
ssb   rsa4096 2022-02-09 [E]
      Keygrip = C04ACB8C33AAA68943194D7D1A56954BF76B5C2C

Look at the sec block and at the Keygrip entry: 80160C5055DA07978E939C0575A4E8DA0B1ECF27.

Binding the two

We will ask the 1Password to retrieve the password and pass it directly to gpg-preset-passphrase specifying our key grip. Note that gpg-preset-passphrase will read stdin by default.

op get item vmgevmdnbbuui3evhksdftjhju --fields password | /usr/lib/gnupg2/gpg-preset-passphrase --preset 80160C5055DA07978E939C0575A4E8DA0B1ECF27

If you weren’t logged in 1Password, you will be asked to input your password.

I am using zsh as a shell, so I will add the following to my ~/.zshrc but should be able to do the same with ~/.bashrc for instance. Note that if you are using powerlevel10k, you will need to put the following before the instant-prompt configuration.

function gpg_cache () {
        gpg-connect-agent /bye &> /dev/null # 1
        eval $(op signin my) # 2
        op get item vmgevmdnbbuui3evhksdftjhju --fields password | /usr/lib/gnupg2/gpg-preset-passphrase --preset 80160C5055DA07978E939C0575A4E8DA0B1ECF27 # 3
}
gpg_cache # 4

gpg-agent is automatically started when required, however since gpg is not used in this but we still require gpg-agent we need to make sure it is started. This is the best way I found to achieve it.
Login in 1Password.
Using our one-liner to retrieve the passphrase and cache it.
Calling our beautiful function.

N.B. This will require you to log in each time you start a session. You could also simply remove the call to gpg_cache, and call it from your terminal.

Testing it

Go in one of your git repository, and let’s create a branch and try this out.

❯ git checkout -b signing
Switched to a new branch 'signing'

❯ touch dirty

❯ git add dirty

❯ git commit -m "Trying signing"
[signing 1426360] Trying signing
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter-2/dirty

❯ git log --show-signature -1
commit 1426360d301b88036feef02e00044e6ca62a9fd3 (HEAD -> signing)
gpg: Signature made Tue Feb 15 21:46:51 2022 CET
gpg:                using RSA key 9BA03414AB56590B6DB5369F0052A8D354A5C655
gpg: Good signature from "Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com>" [ultimate]
Author: Baptiste Maingret <baptiste.maingret@gmail.com>
Date:   Tue Feb 15 21:46:51 2022 +0100

    Trying signing

Remove our dirty work.

❯ git reset HEAD^

❯ rm dirty

❯ git checkout main
Switched to branch 'main'

❯ git branch -D signing

Sources

Reading session #4

2021-12-20T00:00:00+01:00

Articles

How AI can improve products for people with impaired speech
Unlocking human rights information with machine learning
A $9B AI Failure, Examined
Machine Learning Model Development and Model Operations: Principles and Practices
Avoid These Mistakes with Time Series Forecasting

How AI can improve products for people with impaired speech

Source:

Some have recorded hundreds or thousands of specific phrases in order to train and optimize Google’s AI-based algorithms

Voice recognition models are built from thousands of speech recording but none of them are from people suffering speech impairment. ALS TDI joined forces with Google and recruited people with ALS to record thousands of voice examples. By training their voice recognition models with those recordings they manage to improve their model to recognize impaired speech. They do not provide any specific number on the accuracy improvment.

Once again, data is the key in those deep learning models, but it also nice to see it working!

Unlocking human rights information with machine learning

Source: blog.google.com

they’ve built new tools that can automatically tag human rights documents so they are searchable — making the curation process 13 times faster

Surveying the evolution of human rights accross the globe is a challenging and time consuming task! HURIDOCS built several models to help processing human rights information corpus, extracting and classifying relevant data.

In addition to winning the Peace and Justice Strong Institutions Award at the 2021 edition of CogX, they also offer a key tool as open source: Uwazi that allows human rights defenders to store, organize and search through collections of human rights information.

A $9B AI Failure, Examined

Source: linkedin.com

This has already made the news multiple times: Zillow, an online real estate marketplace, lost hunderds of millions in addition to their stock going down. And this because of a poort usage of a ML model.

Basically one of their business was to estimate house prices and buy them to their owner in the perspective of making a benefit on its future sell. For that they use a ML algorithm which presumably was from the Kaggle competition they conducted. Sadly the real estate market is not so simple. The article points out a few key issues:

Real estate market is not stable and is extremely subject to external effects (c.f. Covid-19)
Zillow wanted to renovate home before selling them. However what happens if instead of selling the home in 2 months, you have to wait 6, 12 or even more because of delays…
You have to think upfront who is willing to sell their home fast directly to Zillow. No visit of the property before buying?!
This one is very interesting: if your model is right 95% of the time, meaning 2.5% of the time it will be under the right price, it is safe to assume most people won’t sell you these houses. This would end up with only buying houses either at the right price or to high.
Sometimes the smallest change makes all the difference. AI models can’t encode them all. For example a difference of a few street number can drastically change the price.
Last one is golden: a senior DS job offer focusing on Facebook Prophet library skill. This might show that the ML managment behind Zillow’s algorithm may not have had the focus on the right things.

Machine Learning Model Development and Model Operations: Principles and Practices

Source: kdnuggets.com

This article summarize all the steps of deploying a ML model in production, including key part as model performance monitoring or model version management. I think this article transpires one key issue as of today: the number of steps and tools can be overwhelming. You can go the simple way, with custom Python code and some DevOps principles, but that either will not scale very well or will rqeuire a lot of effort and ressources to grow. And once you have a lot of resources available, it will make it more difficult to work on the same project. Cloud platform solutions can then be of help, making complex tools easily available but at the cost of being more tightly coupled with their platform.

Avoid These Mistakes with Time Series Forecasting

Source: kdnuggets.com

Sometimes we want to quickly see if our data is anything but random, so we generate random samples and compare simple metrics to see if there is any significant difference. However if we don’t pick the right distribution or random generator, it is easy to get fouled.

In this example, they make the point that market stock price time series shouldn’t be compared to random samples drawn from a normal distribution, but to random walk generated numbers.

In the same way, it is easy to stop as soon as we identify any relevant differences. Such when they compare the differenced time series which seem to be different, but once you compare their autocorrelation, this is not the case anymore.

It is easy to fall in “you only find what you are looking for” pit, and one must be careful and prefer a consistant approach to data anlysis.

Python and Poetry on Docker

2021-11-15T00:00:00+01:00

Build a multi-stage Docker image from official Python images with support for Poetry projects.

Source code on Github

Updated following issues on the GitHub repository.

Sources:

Summary

Multi-stage build
Choosing a base version
Stage: Staging
Stage: Development
- Install our project
- Flask webserver and entrypoint
Stage: Build
Stage: Production
Build our image and use it!
- Production image
- Development image

Multi-stage build

A multi-stage build allows:

to stop at a specific step of a build
start from different base thus beginning a new stage of the bulid
pass artifacts from one build to another

I started this while looking for the best way to use both Docker and Poetry, and stumble upon an quite complete dockerfile at github.com/michaeloliverx/python-poetry-docker-example. The author use a multi-stage build to offer several images for development, testing, linting and production. Stopping at a specific images allows to minimize the image size and time to run for each step. However I was not completly happy with this example which was a bit too complex for me and I wanted to dig in anyway.

In our case we will have the following stages:

staging: Installs poetry and copy relevant source files
development: Install our project in editable mode
build: Build our project into a wheel file
production: Clean Python image that installs our built wheel

Choosing a base version

To start of our image we need to chose the base image, with two obvious options:

Official Linux images (Ubuntu, Debian, RHEL)
Official Python images

I discarded the first one as you don’t always have the most recent Python versions, and I am not so worried about performance differences pointed out by Python=>Speed.

Regarding official Python images on docker.com, we still have to chose the tag (i.e. flavor) we want:

python:(-): based of a specific or not Debian version with common packages installed
python:-slim: based of a specific or not Debian version with only the strict requirements for working Python environment
python:-alpine: discarded because although much smaller, it brings complexity specific to the Alpine distribution

I chose 1 by default, the recommended version and including building tools that are needed anyway.

FROM python:3.10.0 as python-base

Stage: Staging

ARG and environment variables

We set up a few environment variables for Python, Pip and Poetry configurations.

A few things to keep in mind:

Dockerfile doesn’t support reference to previously defined environment variables in the same ENV instruction.
ARG defined before a stage can be used by referencing them again in the stage.
When using a different base image than the previous stages, ENV variables won’t be defined anymore. (seems obvious once said…)

ARG APP_NAME=coach_planner
ARG APP_PATH=/opt/$APP_NAME
ARG PYTHON_VERSION=3.10.0
ARG POETRY_VERSION=1.1.11

Python process will be ran only once in the container so we don’t need to write the compiled Python files (*.pyc) to disk
Make sure Python outputs are sent straight to terminal
Make sure Python traceback are dumped (even on segfaults for instance)
No need for the cache in the Docker image

ENV \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PYTHONFAULTHANDLER=1 
ENV \
    POETRY_VERSION=$POETRY_VERSION \
    POETRY_HOME="/opt/poetry" \
    POETRY_VIRTUALENVS_IN_PROJECT=true \
    POETRY_NO_INTERACTION=1

We won’t update the pip version in any case
Default timeout is only 15 seconds
Fixing Poetry version
Instead of /root
Make sure the .venv directory will be in the build directory
No prompt from Poetry
Path for building stages
Add the virtual environment to path in a separate ENVline to use previously defined environment variables.
Update path with Poetry and virtual env path.

Install Poetry

Installation follow Poetry’s official documentation and make use of the new install script supporting the upcoming Poetry version. We need to update our PATH to be able to use poetry afterwards.

# Install Poetry - respects $POETRY_VERSION & $POETRY_HOME
RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python
ENV PATH="$POETRY_HOME/bin:$PATH"

Source file and dependencies

To copy our source files, we make the assumption that the directory structure follows the one poetry would create:

poetry-demo
├── pyproject.toml
├── poetry_demo
│   └── __init__.py

We also obviously import the poetry.lock file.

WORKDIR $APP_PATH
COPY ./poetry.lock ./pyproject.toml ./
COPY ./$APP_NAME ./$APP_NAME

Stage: Development

Make sure to specify the ARG we need in this stage.

FROM staging as development
ARG APP_NAME
ARG APP_PATH

Install our project

Nothing fancy here, we make use of poetry command.

WORKDIR $APP_PATH
RUN poetry install

Flask webserver and entrypoint

In development mode we use the default flask webserver. We first define a few flask related environment variable, and then define an entrypooint as the flask run command. This run the following command in the activated virtual environment of the project. This has several advantages:

no direct path manipulation
express the intented use of this stage
document how to use this stage
easy command override while not bothering with the virtual environment. e.g. docker run -it poetry flask shell.

ENV FLASK_APP=$APP_NAME \
    FLASK_ENV=development \
    FLASK_RUN_HOST=0.0.0.0 \
    FLASK_RUN_PORT=8888

ENTRYPOINT ["poetry", "run"]
CMD ["flask", "run"]

We can still access a shell by overriding the entrypoint, but that shouldn’t be the most common use case imho. Note that in this cas one should activate the environement.

Stage: Build

We first use the poetry build command, and add the --flag wheel parameter to only build the wheel.

Then we use poetry export to get a file containing dependency version contrainsts for our future pip installation. We pass the --without-hashes but this could be removed and take part of pip install --require-hashes.

FROM staging as build
ARG APP_PATH

WORKDIR $APP_PATH
RUN poetry build --format wheel
RUN poetry export --format requirements.txt --output constraints.txt --without-hashes

Stage: Production

For our production, we will start from a clean python image, and install our freshly built application.

Environment variables

We redefined some Python related environemnt variable (required since we start from a fresh image), and we add some directly related to PIP. Note that again we reference our ARG to be able to use them in this stage.

FROM python:$PYTHON_VERSION as production
ARG APP_NAME
ARG APP_PATH

ENV \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PYTHONFAULTHANDLER=1

ENV \
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 

Installating our application

We first retrieve the packaged application and constraints file from the build stage using the --from flag of the CP command. Then we proceed with the installation using PIP.

Note that we make use of wildcards, but we could also add the application version similarly to Python and Poetry versions, to get a more determinist install.

WORKDIR $APP_PATH
COPY --from=build $APP_PATH/dist/*.whl ./
COPY --from=build $APP_PATH/constraints.txt ./
RUN pip install ./$APP_NAME*.whl --constraint constraints.txt

Entrypoint

We will use gunicorn to serve our application in production.

We define two environment variables that will be used in the gunicorn command allowing to be overriden (mostly for the PORT). This can be used when deplyoing on GCP Cloud Run for instance.

# gunicorn port. Naming is consistent with GCP Cloud Run
ENV PORT=8888 
# export APP_NAME as environment variable for the CMD
ENV APP_NAME=$APP_NAME

The difference between ENTRYPOINT and CMD can be confusing especially when used together, and I would recommend reading Docker documentation on the topic. In our case we need shell variable substitution for the environment variable so it limits our choice. Other alternative would be to use a config.py file.

COPY ./docker/docker-entrypoint.sh /docker-entrypoint.sh # 1
RUN chmod +x /docker-entrypoint.sh # 1
ENTRYPOINT ["/docker-entrypoint.sh"] # 2
CMD ["gunicorn", "--bind :$PORT", "--workers 1", "--threads 1", "--timeout 0", "\"$APP_NAME:create_app()\""] # 3

Get the entrypoint script (see below) and make it executable
With this syntax, this is equivalent to exec /docker-entrypoint.sh
These arguments will be passed to the entrypoint scripts and can be overriden by the arguments passed to docker run

And the script:

#!/bin/sh
set -e # 1 
eval "exec $@" # 2

Will exit the script if any error occurs
Expand arguments passed to the entrypoint script in the shell before passing them to exec, so that it can support passing environment variable.

Build our image and use it!

Production image

Let’s build our image and try using it.

$ docker build --tag poetry --file docker/Dockerfile .

[+] Building 33.3s (17/17) FINISHED
 => [internal] load build definition from Dockerfile                                  0.0s
 => => transferring dockerfile: 2.13kB                                                0.0s
 => [internal] load .dockerignore                                                     0.0s
 => => transferring context: 2B                                                       0.0s
 => [internal] load metadata for docker.io/library/python:3.10.0                      1.2s
 => [internal] load build context                                                     0.0s
 => => transferring context: 332B                                                     0.0s
 => CACHED [staging 1/5] FROM docker.io/library/python:3.10.0@sha256:bb797f045026352  0.0s
 => => resolve docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab376c5666  0.0s
 => CACHED [production 2/6] WORKDIR /opt/coach_planner                                0.0s
 => [staging 2/5] RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poe  22.1s
 => [staging 3/5] WORKDIR /opt/coach_planner                                          0.1s
 => [staging 4/5] COPY ./poetry.lock ./pyproject.toml ./                              0.1s
 => [staging 5/5] COPY ./coach_planner ./coach_planner                                0.1s
 => [build 1/2] WORKDIR /opt/coach_planner                                            0.1s
 => [build 2/2] RUN poetry build --format wheel                                       2.2s
 => [production 3/6] COPY --from=build /opt/coach_planner/dist/*.whl ./               0.1s
 => [production 4/6] RUN pip install  ./coach_planner*.whl                            3.5s
 => [production 5/6] COPY ./docker/docker-entrypoint.sh /docker-entrypoint.sh         0.1s
 => [production 6/6] RUN chmod +x /docker-entrypoint.sh                               0.6s
 => exporting to image                                                                0.2s
 => => exporting layers                                                               0.2s
 => => writing image sha256:a612b9cdb91bacb1cefd1393318a5d15238034183fd48dbbeafffdd7  0.0s
 => => naming to docker.io/library/poetry                                             0.0s

Note that there is only the staging, build, and production stages here. The development stage is not required for the final (i.e. default) stage, it is not even built.

$ docker run -p 8888:8888 -it poetry

[2021-11-15 18:17:59 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-11-15 18:17:59 +0000] [1] [INFO] Listening at: http://0.0.0.0:8888 (1)
[2021-11-15 18:17:59 +0000] [1] [INFO] Using worker: sync
[2021-11-15 18:17:59 +0000] [11] [INFO] Booting worker with pid: 11

Let’s change our gunicorn binding port:

$ docker run -p 8888:8888 --env PORT=5555 -it poetry

[2021-11-17 18:21:28 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-11-17 18:21:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:5555 (1)
[2021-11-17 18:21:28 +0000] [1] [INFO] Using worker: sync
[2021-11-17 18:21:28 +0000] [7] [INFO] Booting worker with pid: 7

Development image

Now let’s use our development stage:

$ docker build --target development -t po
etry --file docker/Dockerfile .

[+] Building 1.4s (12/12) FINISHED
 => [internal] load build definition from Dockerfile                                  0.0s
 => => transferring dockerfile: 38B                                                   0.0s
 => [internal] load .dockerignore                                                     0.0s
 => => transferring context: 2B                                                       0.0s
 => [internal] load metadata for docker.io/library/python:3.10.0                      1.2s
 => [internal] load build context                                                     0.0s
 => => transferring context: 260B                                                     0.0s
 => [staging 1/5] FROM docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab  0.0s
 => => resolve docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab376c5666  0.0s
 => CACHED [staging 2/5] RUN curl -sSL https://raw.githubusercontent.com/python-poet  0.0s
 => CACHED [staging 3/5] WORKDIR /opt/coach_planner                                   0.0s
 => CACHED [staging 4/5] COPY ./poetry.lock ./pyproject.toml ./                       0.0s
 => CACHED [staging 5/5] COPY ./coach_planner ./coach_planner                         0.0s
 => CACHED [development 1/2] WORKDIR /opt/coach_planner                               0.0s
 => CACHED [development 2/2] RUN poetry install                                       0.0s
 => exporting to image                                                                0.1s
 => => exporting layers                                                               0.0s
 => => writing image sha256:cae979092d3b1e2c833e96f2e9acdfbd6980609f36d907a6547f05f1  0.0s
 => => naming to docker.io/library/poetry                                             0.0s

Here only the staging and developement stages are run!

And using it:

$ docker run -p 8888:8888 -it poetry

 * Serving Flask app 'coach_planner' (lazy loading)
 * Environment: development
 * Debug mode: on
 * Running on all addresses.
   WARNING: This is a development server. Do not use it in a production deployment.
 * Running on http://172.17.0.2:8888/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 774-185-805

Or overriding the default CMD and getting a Python shell:

$ docker run -p 8888:8888 -it poetry python

Python 3.10.0 (default, Oct 26 2021, 22:20:53) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Or even overriding the default ENTRYPOINT and then starting a poetry shell:

$ docker run -p 8888:8888 --entrypoint /bin/bash -it poetry

root@4dd4bb68f510:/opt/coach_planner# poetry shell
Spawning shell within /opt/coach_planner/.venv
root@4dd4bb68f510:/opt/coach_planner# . /opt/coach_planner/.venv/bin/activate
(.venv) root@4dd4bb68f510:/opt/coach_planner#

Reading notes for Architecture Patterns with Python by Harry Percival, Bob Gregory

2021-08-13T00:00:00+02:00

Notes, references and codes I wrote while reading and coding along Architecture Patterns with Python by Harry Percival, Bob Gregory - O’Reilly.

Work in progress

My code along repository bmaingret/architecture-patterns-code-along.

The full code by the authors is also available, as well as their book at github.com/cosmicpython.

Chapter 1 - Domain model
Chapter 2 - Repository Pattern
Interlude - Reproducibility and Continuous Integration
Chapter 3 - Coupling and Abstractions
Chapter 4 - Service Layer pattern
Chapter 5 - TDD in High Gear and Low Gear
Chapter 6 - Unit Of Work
Chapter 7 - Aggregate and Consistency Boundaries
- Handling concurrency
Chapter 8 - Events and the Message Bus

Chapter 1 - Domain model

The domain is a fancy way of saying the problem you’re trying to solve

The domain model is the mental map that business owners have of their businesses

Diving in the domain model

Understand the business jargon and keeps a glossary
Get concrete examples of the rules defining the domain model
TDD: translates those rules into unit tests

Value Object Pattern

any domain object that is uniquely identified by the data it holds; we usually make them immutable

Which in Python can easily be converted to frozen dataclasses, offering hash for free.

@dataclass(frozen=True)
class OrderLine:
    order_reference: str
    sku: str
    quantity: int

Note however than because SQLAlchemy modifies the class at runtime, we have to use unsafe_hash=True instead…

Domain entity

Domain object that has long-lived identity

Usually mutable and have a fixed identity not depending on their values.

We usually make this explicit in code by implementing equality operators on entities:

In Python that means defining the __eq__ operator.

Careful when defining hash, which basically means identifying what uniquely defines an entity along its life.

Not everything must be in a class

Domain Service Function: in Python we can put the function in the module without making it more than that.

Exceptions as domain concepts

Business errors/exceptions can be nicely represented by exceptions

class OutOfStock(Exception):
    pass

Chapter 2 - Repository Pattern

Repository pattern

Make an abstraction around the storage. Looks like everything is stored in-memory and allows for

Port and Adapter

Port usually is some interface, and adapter its implementation. In Python, this usually translates to some abstract base class and its implementation, but it can also be an implicit duck type port.

ORM

ORM can lead to high dependency towards the ORM framework, and one must be careful to invert the dependency and makes the ORM depends on the domain models instead.

Interlude - Reproducibility and Continuous Integration

Although I had read a lot and knew this was the way to go, I never took the time to implement it in any of my project, so I thought this would be a good opportunity, even more since the original authors take this path as well (Makefile and Docker).

Development and Production Environment - Python mess

I mostly followed https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry

I had so much trouble making everything work under plain Windows, that I moved all my Python dev to WSL2.

https://xkcd.com/1987

I had learned of Poetry a year ago, but still stuck to Conda. I never liked the Conda way of handling dependencies, but it is still of great help to install some data science tools that are not pure Python.

Install Conda (I prefer to use the Miniconda installer)
Install Poetry (through the install-poetry.py script). N.B. this requires a working installation of Python, so if you only installed it with Conda, install it through a shell with a Conda environment active.
Create a minimal Conda environment (conda create --name remote-work-env python=3.8.5 or conda create --file environment.yml)
Set up a new project/Init an existing project using Poetry (e.g. poetry init from within an existing directory). N.B. Poetry should pick up the active Conda environment and not create a new one.
Manage dependencies with Poetry (poetry add sqlalchemy, poetry add --dev pytest)

Github Actions

Github Actions are a recent (and welcome) addition to Github, allowing CI/CD workflow right in Github. Although it is possible to stay in the Github ecosystem, some Actions still depends on external tools and API.

To get a good starting example, one can go to the Actions tab in a github repository, and at the Choose a workflow template select Skip this and set up a workflow yourself, an editable configuration template will be made available right from the browser.

For this repository, I have a single workflow containing the test passing, test coverage, linting and formatting. The Python setup is done thanks to an available Github Actions, dependency install thanks to Poetry, coverage through pytest-cov and codecov, and finally linting/formatting thanks to pre-commit.ci. More on that in next part on pre-commits.

Git hooks - Pre-commit

Instead (or in addition) of automating code quality checks before merging, we can take part of the git hook to check these before even committing. Thanks to some good people, we have an awesome Python tool for that: pre-commit.com. Once installed, configuration can be made through a .pre-commit-config.yml file and then installed with pre-commit install. Once installed you can see the generated Python script file .git/hooks/pre-commit.

To configure hooks with pre-commit, you need to specify git repositories. Several hooks are available at the pre-commit repository, among which I configured:

check-yaml
check-toml
end-of-file-fixer
trailing-whitespace

To this I added Flake8 a Python linter, and black a code formatter (this will modify your code, although you can make it not, but that would go against the point).

Note that thanks to pre-commit.ci, the same configuration file can be used both to install the hooks locally, and to run checks in a Github Action.

Chapter 3 - Coupling and Abstractions

reduce the degree of coupling within a system by abstracting away the details

Some key takeaways:

Abstractions and decoupling help for testing (c.f. the repository pattern).
Separate the core logic code from external states.

This usually allows to do edge-to-edge testing, faking some details (quite often I/O). This requires some additional abstractions (around the filesystem for instance) and new explicit dependencies on this abstractions).

Chapter 4 - Service Layer pattern

Also called an orchestration layer or a use-case layer.

Service layer exposes the domain service functionalities through endpoints to the external world.
It wraps the boring stuff such as validate entry, calling the domain model and updating it, and finally persisting anything
Interacting with our domain model is easier and allows for different type of interactions (cli, web, etc.)
Ease the high level and end-to-end tests, allowing for fewer tests, and easy refactoring of underlying domain models

Although the concept of service is interesting, this chapter leaves things in a dubious state with still a lot of coupling towards the ORM from both the Flask app and services, the low details of domain implementations are everywhere, and testing is getting more and more difficult to init properly.

Chapter 5 - TDD in High Gear and Low Gear

Analogy made with biking where yous tart with low gear (unit tests) and then start moving towards higher gears (e2e tests). Allowing to hide further more the implementation details and to have tests with less coupling towards implementation details.

To reduce coupling with domain models:

Fixture functions to help initialize domain models
Adding services that will handle the domain models

I’ll just copy/paste from the book here for rules of thumb regarding tests to implement:

Aim for one end-to-end test per feature

Write the bulk of your tests against the service layer (edge-to-edge)

Maintain a small core of tests written against your domain model (maintain is the important word here: start with a lot and delete once they are covered by services)

Error handling counts as a feature

Express your service layer in terms of primitives rather than domain objects.

Chapter 6 - Unit Of Work

Services and API are still tightly coupled with the data persistency and session management. Unit of Work define a single entry point for data storage, allowing to nicely handle transactions (commits, rollbacks, failures, etc). It also eases the integration between the service and the repository layers.

In Python, it fits very well the context manager type.

Chapter 7 - Aggregate and Consistency Boundaries

Invariants are conditions that are always true.

Constraints are rules that restricts the possible states of the model

In order to ensure invariants and constraints, and in addition of the logic behind, we need to ensure the data integrity, especially in concurrent operations. While we could lock the entire table/database we are manipulating this won’t scale up. The aggregate pattern groups up several domain objects in a container and allow to manipulate them as a single entity, thus ensuring data integrity and consistency of everything in it (the actual implementation will ensure it, not the simple use of the aggregate pattern).

The choice of aggregates is not simple and depends of the constraints of each project. Keep in mind that the less data in the domain models, the easiest it is to ensure invariants and constraints.

Direct link @Cosmyc Python

Handling concurrency

Optimistic concurrency: suppose things work fine most of the time

Locking things at db level usually comes with performance cost -> usually used in a pessimistic concurrency mindset
Using version numbers to control update and be able to detect and recover from concurrent updates

Note: there is a lot a db specific way of implementing locking at different levels (consistent read, select for update, etc.)

Chapter 8 - Events and the Message Bus

Events help to enforce the Single Responsibility Principle (the S of SOLID), preventing having multiple use cases tangled in a single place.

Message bus allows to route the event messages to the different handlers. Typical middleware.

Events can be raised and handled at different places:

Service layer takes events raised by the models and send them straight to message bus
Service layer raises events directly to the message bus
UoW collects events from aggregates and send them to the message bus

Direct link @CosmicPython

Chapter 9 - Going to Town on the Message Bus

Using the message bus as an entry point for the service layer.

allows to be granular and stick to SRP
allows to write tests in terms of events.

Service functions become event handlers, and as such all internal and external actions are managed the same way, through event handlers.

When large changes are incoming adopt

follow the Preparatory Refactoring workflow, aka “Make the change easy; then make the easy change”

Reading session #3

2021-04-21T00:00:00+02:00

Articles

Python in Visual Studio Code – April 2021 Release
Is manual ETL better than No-Code ETL: Are ETL tools dead?
The Explosion of Roles in Data Science
The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore
Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.
10 Tips and Tricks for Data Scientists Vol.4

Python in Visual Studio Code – April 2021 Release

Source: devblogs.microsoft.com

I have been using VS Code for 2 years now, mostly for Python, including notebooks, but also for any text file.

Support for Poetry environments

Never used but probably will be the next I use if I ever move from conda. It ditches setup cfg an py files for the PEP pyproject.toml files. Also it allows to make the differences between what you wanted to install and every dependencies that got installed along!

Better auto-completions for Pytorch using Pylance

Never had the opportunity to face these issues since so far I have used only TensorFlow and it was in Jupyter notebook where I have found auto-completion to be a bit clumsy.

Data Viewer Enhancements

Data viewer is one of the reason to ditch Jupyter notebooks and use VS Code. Since I had spent some times on RStudio prior to Python for data science, this data viewer was one of the feature that was missing for me.

Enhancements listed:

Ability to refresh the data viewer
Support for PyTorch and TensorFlow Tensor data types
Visual update
Ability to slice data (huge!): easily see specific dimensions of high dimension data

Is manual ETL better than No-Code ETL: Are ETL tools dead?

Source: analyticsvidhya.com

Try to oppose GUI tools to pure code tools for ETL purposes, which I think in most cases don’t make so much sense. In most GUI tools you have a way to script some part to have custom tailored transformation for instance. And as for pure code for ETL, it can quickly become challenging and a human resource intensive process. In the end it usually comes down to the knowledge and know-how of the teams at work.

The Explosion of Roles in Data Science

Source: towardsdatascience.com

We have data scientists, data analysts, data engineers, machine learning engineers, analytics engineers, business intelligence engineers, data architects, data storytellers…

How to overcome the overwhelming effect of so many different roles to chose from? Especially when job offers usually mixes everything and that you have no or little prior experience in any of these roles?

“You are not your role”: don’t limit yourself to the job title, roles overlap more than often, and you are not tied to a job name
“Focus on abilities rather than on roles”: As I said it is often unclear what exactly lies under each role. But abilities stay the same. Some company might think a Data Engineer is a Database Admin. So? Don’t focus on the role but on the ability you’ll develop: manipulating database and SQL.
“Keep learning, keep improving”: Some company focus to much on what people know at the instant they want to recruit them. Considering the pace at which Data Science is evolving I think it is fair to accept people who have the fundamental abilities required for DS/ML and who keep on learning.

The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore

Source: medium.com/illumination

“#1 People Doesn’t Know What Actually Is Data Science”: Be the people wanted to get into DS and people trying to recruit datascientists.
“#2 Expectation vs. Reality — Here Lies A Wide, Wide Gap!”: People getting into DS think they’ll work each week on a new project with cool new tech or algorithm…
“#3 Lack of Upskilling for Data Science Professionals”: Things are moving so fast it can get hard to be an expert of anything, which companies are looking for
“#5 People Aren’t Willing To Wait”: It is a long road to become a proficient data scientist…

Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.

Source: towardsdatascience.com

This article seems just wrong to me:

“a machine learning engineer does not necessarily need to know how random forest works, but they need to know how to save and load a file automatically”.
“If you can master these three base skills, you will be well on your way to being a great data scientist”. Sills being Python, Jupyter and SQL. If that’s all you require from a data scientist, look just above.

10 Tips and Tricks for Data Scientists Vol.4

Source: r-bloggers.com

You can get Google Drive data directly into Google Colab

from google.colab import drive
drive.mount('content/gdrive')

Reading/Writing Pandas DF directly as GZip (use compression='gzip')

Others can be checked directly but were not of much interest for me.

Reading session #2

2021-03-21T00:00:00+01:00

Articles

Bringing Machine Learning models into production without effort at Dailymotion
How To Run a Python Script Using a Docker Container
How to build a DAG Factory on Airflow
Deploying Machine Learning into Production: Don’t do Labs.
MLOps Is Changing How Machine Learning Models Are Developed

Bringing Machine Learning models into production without effort at Dailymotion

Source: medium.com/dailymotion

How we manage to schedule Machine Learning pipelines seamlessly with Airflow and Kubernetes using KubernetesPodOperator

Life cycle of a machine learning model - Dailymotion (c)

They followed what I would call a classic evolution:

First a containerized approach with an always up VM-like and simple cron-like schedule
Then VM-like instantiation to run the container on-demand
Finally define node pools of different types allowing to run on-demand container and share resources

All of this thanks to the good integration between Airflow and Kubernetes (c.f. KubernetsPodOperator).

In the article they mention Kubeflow a Google service specifically thought for Machine Learning Workflow.

How To Run a Python Script Using a Docker Container

Source: towardsdatascience.com

Very simple introduction to setting up a Docker with required tools and software. Although I find Docker very attractive for reusability it does require do put everything explicitly which in my typical working environment based on conda can be a pain if I don’t want to add too much useless overhead.

How to build a DAG Factory on Airflow

Source: towardsdatascience.com

I encountered similar concern when trying out Jenkins at work, and there was very little example on how to set things up properly. In addition, since I am more of a configuration over code type of guy, I wanted to take part of recently introduced Jenkins Pipeline. Trying to set up the two while being the only knowledgeable person on this topic seems like too much work and an unnecessary SPOF for our needs in the end.

In this case the final result is appealing but it still seems a bit of a hack (assumptions of only Python files, and relying on hardcoded heuristic DAG detection rule) and I’d rather like that a tool such as Airflow provides some common interface for this.

Deploying Machine Learning into Production: Don’t do Labs.

Source: towardsdatascience.com

Although I was aware that a large majority of data science projects don’t make it to production (87% claimed here), this article states that this is not due to the lack of value for the models but more to the difficulty to scale.

Gap between data scientists and data engineers
Development in isolated environments away from user interaction, production constraints and businesses
ML adds additional metrics to monitor that can be either difficult to implement or to foresee (e.g. gender bias)

I thought this article would describe how to industrialize development environment to match the production environment but it actually makes a point that this is not enough. They put data scientists in the existing product teams where they work as additional resources. This allow first to work hands-on with the people that are responsible for the end product, and also for the product owner to make enlightened decisions on where to put effort.

Some advantages of working in labs that are yet to be replicated in their new paradigm:

System Thinking

Allow to get out of the team routine and approach each problem from with a fresh perspective

Ideal Design

Don’t limit yourself to what seems possible and not what would be ideal

Always be Innovating

Use cutting-edge tools and solutions without having limitations on what will be possible. Explore and test without having to think rentability and production efficiency.

MLOps Is Changing How Machine Learning Models Are Developed

Source: kdnuggets.com

MLOps has to be one of the most popular topic in ML world today. Here a few key points are addressed to show the implication of moving from ML labs to proper production-ready ML.

Version Control is Not Just for Code

Data versioning is a big concern in ML. With continuously larger data sets typical versioning tools might not be applicable. In addition, GDPR-like concerns means that data is usually more sensitive information than code-base.

Build Safeguards into the Code

Checking the input data used for training, validation, etc. prevent big swings in your models trainings. Similarly checking differences between the previous and new models allows early detection of issues (could be done by checking element by element prediction differences).

The Pipeline is the Product – Not the Model

You give a poor man a fish and you feed him for a day. You teach him to fish and you give him an occupation that will feed him for a lifetime.”

Only the whole pipeline allows for proper control over the models and in the end, positive value for the project.

I’d cross this with previous article on data scientists working in labs. Even a well designed and defined pipeline might not be production ready nor usable in real-world.

Reading session #1

2021-02-05T00:00:00+01:00

Articles

Apache Superset
Create a DevOps culture with open source principles
Google recommits to the Python ecosystem
Now announcing: Makefile support in Visual Studio Code!
Abracadabra! Bringing the magics to xeus-python

Apache Superset

Source: superset.apache.org

Apache Superset (Incubating) is a modern, enterprise-ready business intelligence web application.

Some comments:

“Enterprise-ready” might be a stretch from others experiences
It seems to me to be more appropriate to be used with a single consolidated database (think datalake) than multiple databases
Some advanced charts but with simple/cleaned data. It really is a visualization tool, all preprocessing must be done before.

Create a DevOps culture with open source principles

Source: https://opensource.com/article/20/12/remote-devops

I find this article to provide reasonable guidelines that are applicable to IT departments (and probably others) being remote or not.

Open source principles:

Community

Being part of team goals can help people escape the stress of the home front.

Nothing worse than to be home alone facing an issue with no support.

Collaboration

Collaboration—during a pandemic or not—is about culture, not the latest tool or platform.

Oftentimes people look for new tools to help them to better collaborate whereas mindset

Transparency

Remote DevOps teams benefit from centralizing access to project information and materials.

Although I find that every information shouldn’t be directly address to everyone, there is nothing worse than withholding information. I think everyone is able and should be allowed to comment and give one’s opinion.

Release early and often

When a remote DevOps team releases early and often, they prove the remote work model’s validity and give stakeholders something real to see

Although in some companies and industries it is not welcome to come with unfinished projects, I find it rewarding and motivating for everyone. Good balance is necessary since each release does bring additional work.

Pivot and refresh

Just as you stop to correct software delivery issues, you need to start doing the same with communications and collaboration.

When you find something did not happened as expected because of poor communication, don’t mourn on it, take it as an opportunity to make some changes.

Google recommits to the Python ecosystem

Source: https://sdtimes.com/softwaredev/google-recommits-to-the-python-ecosystem

Google cloud environment was my first experience with Python thanks to their Google App Engine.

It is always important for large companies to put significant amount of support to these technologies considering how they build on it, while being careful not to fall in situations similar to Oracle/SUN and the Oracle/Java ecosystem.

Now announcing: Makefile support in Visual Studio Code!

Source: https://devblogs.microsoft.com/cppblog/now-announcing-makefile-support-in-visual-studio-code

Although make and makefile are more than 50 year old, and are not the new shiny tools, they can be of great help in data-science projects.

There is usually a lot of similar steps required to set-up environments, run different steps (data ETL, training models, evaluation, etc.), managing cloud resources, etc. which can greatly be speed up and made less error-prone thanks to makefiles.

Another article presenting some of it: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c.

Abracadabra! Bringing the magics to xeus-python

Source: https://blog.jupyter.org/abracadabra-bringing-the-magics-to-xeus-python-9d17bcfacb4

It is always interesting to read documentation on the behind the scenes of some of the tools we used. If find it greatly enhance they way I work with them by getting a glimpse of how and why the tool was made as it is and where it is going.

This is article presents the work on the next Jupyter kernel, partly based on Xeus a C++ implementation of the Jupyter kernel protocol. And reading this, all of a sudden we can discover some of the components of what we usually call a Jupyter notebook…

Motor Trend Car Road Tests (mtcars) datasets - Analysis and Regression

2019-11-12T00:00:00+01:00

Motor Trend Car Road Tests (mtcars) datasets - Analysis and Regression

This assignment was part of the Johns Hopkins Coursera module on Regression Models as part of the Data Sciene Specialization.

Source code available on GitHub

Summary

We want to answer these two questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions?

We compared the mean mpg for automatic and manual transmission and concluded the difference in favor of manual tranmission in terms of mpg was significant. We then looked further to check other variables to explain the difference in mpg.

Look at the data

Glimpse at the data.

##    mpg cyl disp  hp drat    wt  qsec       vs     am gear carb mean.mpg
## 1 21.0   6  160 110 3.90 2.620 16.46 v.shaped manual    4    4 20.09062
## 2 21.0   6  160 110 3.90 2.875 17.02 v.shaped manual    4    4 20.09062
## 3 22.8   4  108  93 3.85 2.320 18.61 straight manual    4    1 20.09062

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec              vs    
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   v.shaped:18  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   straight:14  
##  Median :3.695   Median :3.325   Median :17.71                
##  Mean   :3.597   Mean   :3.217   Mean   :17.85                
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                
##  Max.   :4.930   Max.   :5.424   Max.   :22.90                
##          am          gear            carb      
##  automatic:19   Min.   :3.000   Min.   :1.000  
##  manual   :13   1st Qu.:3.000   1st Qu.:2.000  
##                 Median :4.000   Median :2.000  
##                 Mean   :3.688   Mean   :2.812  
##                 3rd Qu.:4.000   3rd Qu.:4.000  
##                 Max.   :5.000   Max.   :8.000

MPG difference between automatic and manual transmission

Looking at the boxplot we see a difference between the two transmission type’s mpg.

We check normality, variance equality to see how we can conduct our test (details in appendix), and then conducted a two-sided T-Test:

mpg.test <- t.test(auto, manual, alternative="two.sided", paired=FALSE, var.equal = FALSE)

We have a p-value of 0.14% < 5%, and a confidence interval [-11 ; -3.2] for the difference of mean mpg between automatic and manual excluding 0.

From the look of this manual transmission allows for more mpg with 0 more mpg in average.

If we fit a simple linear model to our data we end up with similar results as previously (increased of roughly 7.2 mpg), and we can have a look at the residual plot, which are alost normal (graphically speaking) for automatic but not as much for manual. Looking at the reisudals against several other possible predictors, we can see some linear trends (e.g. hp and wt).

Going further

Looking at pairplot and correlation plot we see that other variables since more correlated with mpg than am.

ggpairs(mtcars, aes(colour = am), columns = seq(1,11,1),
        progress=FALSE, upper = list(continuous = wrap("cor", size = 3)))

mtcars.cor <- cor(mtcars %>% mutate(am=as.numeric(am), vs=as.numeric(vs)) %>% select(-c(mean.mpg)))
corrplot(mtcars.cor, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Adding variables to our model

We can try to add wt, cyl and disp wich seems to be relevant candidates both from mechanical point of view and from the corrplot.

rownames(mtcars) <- rownames(datasets::mtcars)
fit2<-lm(mpg~I(hp/10)+wt+cyl+disp+am,mtcars)
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ I(hp/10) + wt + cyl + disp + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5952 -1.5864 -0.7157  1.2821  5.5725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.20280    3.66910  10.412 9.08e-11 ***
## I(hp/10)    -0.27960    0.13922  -2.008  0.05510 .  
## wt          -3.30262    1.13364  -2.913  0.00726 ** 
## cyl         -1.10638    0.67636  -1.636  0.11393    
## disp         0.01226    0.01171   1.047  0.30472    
## ammanual     1.55649    1.44054   1.080  0.28984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared:  0.8551, Adjusted R-squared:  0.8273 
## F-statistic:  30.7 on 5 and 26 DF,  p-value: 4.029e-10

Only weight, hp and tranmission type seems significant.

Modelling withough transmission type

fit3<-lm(mpg~I(hp/10)+wt+cyl+disp,mtcars)
summary(fit3)

## 
## Call:
## lm(formula = mpg ~ I(hp/10) + wt + cyl + disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## I(hp/10)    -0.20538    0.12147  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ I(hp/10) + wt + cyl + disp + am
## Model 2: mpg ~ I(hp/10) + wt + cyl + disp
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 163.12                           
## 2     27 170.44 -1   -7.3245 1.1675 0.2898

We see we have similar R-square, RSS and p-value while droping the transmission type.

Automatic model selection

Let’s try some automatic model selection to see what we could get.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Fit the full model 
full.model <- lm(mpg ~., data = datasets::mtcars)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", 
                      trace = FALSE)
summary(step.model)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = datasets::mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

We find again wt and am which confort us in our previous models. We also have an additional variable that we did not explore before: qsec.

We can however argue that qsec is strongly correlated with horsepower (and cylinder, displacement, etc.)

Some PCA

library("FactoMineR")
library("factoextra")

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

res.pca <- PCA(datasets::mtcars, scale.unit = TRUE, ncp = 5, graph = FALSE)

fviz_pca_var(res.pca, col.var = "cos2", repel = TRUE)

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

Normality and variance

Normality of data

shapiro.test(manual)

## 
##  Shapiro-Wilk normality test
## 
## data:  manual
## W = 0.9458, p-value = 0.5363

shapiro.test(auto)

## 
##  Shapiro-Wilk normality test
## 
## data:  auto
## W = 0.97677, p-value = 0.8987

Comparison of variance

var.test(auto, manual)

## 
##  F test to compare two variances
## 
## data:  auto and manual
## F = 0.38656, num df = 18, denom df = 12, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1243721 1.0703429
## sample estimates:
## ratio of variances 
##          0.3865615

T-Test

mpg.test <- t.test(auto, manual, alternative="two.sided", paired=FALSE, var.equal = FALSE)
mpg.test

## 
##  Welch Two Sample t-test
## 
## data:  auto and manual
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Residual plots

fit<-lm(mpg ~ am, mtcars)
qplot(residuals(fit), color=mtcars$am, geom = 'density')

mtcars$mpg.resid <- residuals(fit)
mtcars.gathered <- mtcars %>% dplyr::select(am, mpg.resid, cyl, disp, hp, wt, qsec) %>% mutate_if(is.numeric, scale) %>% gather(key, value, -c(am,mpg.resid))
ggplot(mtcars.gathered, aes(x = mpg.resid, y = value, color=am)) +
  geom_point() +
  facet_grid(. ~ key) 

Effect of Vitamin C on Tooth Growth in Guinea Pigs

2019-10-30T00:00:00+01:00

Basic Inferential Data Analysis on ToothGrowth dataset (part of Statistical Inference by Johns Hopkins University)

This assignment was part of the Johns Hopkins Coursera module on Statistical Inference as part of the Data Science Specialization.

Source code available on GitHub

Overview

The goal is to conduct some simple hypothesis testing on the ToothGrowth dataset available in the R datasets package.

Some assumptions:

equal variances among groups
standard deviation estimated from the samples
is set to 5%
samples are not paired

Data processing

We import the data and directly set the dose as a factor.

library(ggplot2)
library(datasets)
tg <- datasets::ToothGrowth
tg$dose <- as.factor(tg$dose)

Glimpse at data.

str(tg)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(tg)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

Some plots.

  qplot(x=len, data=tg, color=dose, group = dose, geom = "density", facets = dose ~ supp)

Has the delivery method an impact on tooth growth?

We will test in regards of the null-hypothesis that their is no difference in means between the two groups.

n = 10
x = tg[tg$supp=="OJ", "len"]
y = tg[tg$supp=="VC", "len"]
delta = mean(x) - mean(y)
p.sd = sqrt((var(x)+var(y))/2)

t.res <- t.test(x, y, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE)
p.res <- power.t.test(n, delta, p.sd, sig.level=0.05, type="two.sample", alternative="two.sided")

We have a p-value (6.0393371%) larger the 5% and in addition the confidence interval (-0.1670064, 7.5670064) covers the value 0. We fail to reject the null hypothesis in this case.

Has the dose an impact on tooth growth?

We test the difference in means between each dosage (3 tests: 0.05 vs 1, 0.5 vs 2, 1 vs 2).

n = 10
x = tg[tg$dose=="0.5", "len"]
y = tg[tg$dose=="1", "len"]
delta = mean(x) - mean(y)
p.sd = sqrt((var(x)+var(y))/2)

t.res.a <- t.test(x, y, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE)
p.res.a <- power.t.test(n, delta, p.sd, sig.level=0.05, type="two.sample", alternative="two.sided")

n = 10
x = tg[tg$dose=="0.5", "len"]
y = tg[tg$dose=="2", "len"]
delta = mean(x) - mean(y)
p.sd = sqrt((var(x)+var(y))/2)

t.res.b <- t.test(x, y, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE)
p.res.b <- power.t.test(n, delta, p.sd, sig.level=0.05, type="two.sample", alternative="two.sided")

n = 10
x = tg[tg$dose=="1", "len"]
y = tg[tg$dose=="2", "len"]
delta = mean(x) - mean(y)
p.sd = sqrt((var(x)+var(y))/2)

t.res.c <- t.test(x, y, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE)
p.res.c <- power.t.test(n, delta, p.sd, sig.level=0.05, type="two.sample", alternative="two.sided")

##                      dose.0.5v1      dose0.5v2      dose.1v2
## p-value            1.266297e-07  2.837553e-14  1.810829e-05
## conf-interval-low -1.198375e+01 -1.815352e+01 -8.994387e+00
## conf-interval-up  -6.276252e+00 -1.283648e+01 -3.735613e+00
## power              9.909607e-01  1.000000e+00  9.057799e-01

Conclusions

We failed to reject the null-hypothesis regarding the impact of the delivery method on tooth growth.

The dosage was found to be statistically significant and tests rejected the null-hypothesis.

EDA of activity monitoring data

2019-10-23T00:00:00+02:00

Activity monitoring (part of Reproducible Research module by Johns Hopkins University)

This assignment was part of the Johns Hopkins Coursera module on Reproducible Research as part of the Data Sciene Specialization.

Full code can be found on GitHub.

Loading and preprocessing the data

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken

The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Loading the data:

if (!file.exists("activity.csv")){
  unzip("activity.zip")
}
data <- read.csv("activity.csv", na.strings = "NA", colClasses = c("integer", "character", "integer"))
data$date <- as.Date(data$date, format="%Y-%m-%d")

Checking the data:

str(data)

## 'data.frame':	17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

summary(data)

##      steps             date               interval     
##  Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
##  Median :  0.00   Median :2012-10-31   Median :1177.5  
##  Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
##  3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
##  Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
##  NA's   :2304

head(data)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

What is mean total number of steps taken per day?

For this part of the assignment, you can ignore the missing values in the dataset.

Summarizing the data:

total_steps <- data %>% group_by(date) %>%
  summarise(total = sum(steps, na.rm = TRUE))

g <- ggplot(total_steps, aes(total))
g <- g + geom_histogram()
g <- g + labs(title="Number of steps per day", y="", x="")
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean <- mean(total_steps$total, na.rm = TRUE)
mean

## [1] 9354.23

The mean total number of steps per day is: 9354.23

median <- median(total_steps$total, na.rm = TRUE)
median

## [1] 10395

The median total number of steps per day is: 10395

What is the average daily activity pattern?

Summarizing the data:

steps_interval <- data %>% group_by(interval) %>%
  summarise(mean = mean(steps, na.rm = TRUE))

g <- ggplot(steps_interval, aes(interval, mean))
g <- g + geom_line()
g <- g + labs(title="Average steps per interval", y="Average steps", x="Interval")
g

ix <- which.max(steps_interval$mean)
interval <- steps_interval[[ix, "interval"]]
val <- steps_interval[[ix, "mean"]]

The interval with max mean number of steps is 835 with a mean number of steps of 206.17.

h <- floor(interval/60)
m <- interval%%60

Supposing the interval starts at 00:00 of each day, this interval corresponds to 13:55.

Imputing missing values

Total number of missing values:

apply(is.na(data), 2, sum)

##    steps     date interval 
##     2304        0        0

We will fill in the missing steps values with the mean for the specific day and interval.

First we compute the mean for each interval and day of the week.

data <- data %>% mutate(weekday = as.factor(as.POSIXlt(date)$wday))
fill_val <- data %>% group_by(weekday, interval) %>%
  summarise(mean = mean(steps, na.rm = TRUE))

Imputing missing data.

data_nna <- data

for (row in 1:nrow(data_nna)) {
  if(is.na(data_nna[row, "steps"])) {
    wd <- data_nna[row, "weekday"]
    interval <- data[row, "interval"]
    data_nna[row, "steps"] <- fill_val[fill_val$weekday==wd & fill_val$interval==interval, "mean"]
  }
}

apply(is.na(data_nna), 2, sum)

##    steps     date interval  weekday 
##        0        0        0        0

Repeating first steps of the assignement now with the imputed data. Summarizing the data:

total_steps_nna <- data_nna %>% group_by(date) %>%
  summarise(total = sum(steps))

g <- ggplot(total_steps_nna, aes(total))
g <- g + geom_histogram()
g <- g + labs(title="Number of steps per day", y="", x="")
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean_nna <- mean(total_steps_nna$total)
mean_nna

## [1] 10821.21

The mean total number of steps per day is: 10821.21 (was 9354.23 before imputation).

median_nna <- median(total_steps_nna$total)
median_nna

## [1] 11015

The median total number of steps per day is: 11015.00 (was 10395 before imputation).

Are there differences in activity patterns between weekdays and weekends?

Summarizing the data:

steps_interval_nna <- data_nna %>% mutate(week.part = if_else(weekday %in% c(1,6), "weekend", "weekdays")) %>%
  group_by(week.part,interval) %>%
  summarise(mean = mean(steps, na.rm = TRUE))

g <- ggplot(steps_interval_nna, aes(interval, mean, col=week.part))
g <- g + geom_line()
g <- g + facet_grid(rows = vars(week.part))
g <- g + labs(title="Average steps per interval between weekdays and week", y="Average steps", x="Interval")
g

Baptiste Maingret’s Homepage

Using 1password, GPG and git for seamless commits signing

Requirements and setup

Creating a GPG key

Setting up git

Setting up GitHub

Setting up 1Password CLI

Configuring gpg-agent

Putting it all together

Finding your 1Password entry

Getting your GPG key grip

Binding the two

Running at login

Testing it

Sources

Reading session #4

Articles

How AI can improve products for people with impaired speech

Unlocking human rights information with machine learning

A $9B AI Failure, Examined

Machine Learning Model Development and Model Operations: Principles and Practices

Avoid These Mistakes with Time Series Forecasting

Python and Poetry on Docker

Summary

Multi-stage build

Choosing a base version

Stage: Staging

ARG and environment variables

Install Poetry

Source file and dependencies

Stage: Development

Install our project

Flask webserver and entrypoint

Stage: Build

Stage: Production

Environment variables

Installating our application

Entrypoint

Build our image and use it!

Production image

Development image

Reading notes for Architecture Patterns with Python by Harry Percival, Bob Gregory

Chapter 1 - Domain model

Diving in the domain model

Value Object Pattern

Domain entity

Not everything must be in a class

Exceptions as domain concepts

Chapter 2 - Repository Pattern

Repository pattern

Port and Adapter

ORM

Interlude - Reproducibility and Continuous Integration

Development and Production Environment - Python mess

Github Actions

Git hooks - Pre-commit

Chapter 3 - Coupling and Abstractions

Chapter 4 - Service Layer pattern

Chapter 5 - TDD in High Gear and Low Gear

Chapter 6 - Unit Of Work

Chapter 7 - Aggregate and Consistency Boundaries

Handling concurrency

Chapter 8 - Events and the Message Bus

Chapter 9 - Going to Town on the Message Bus

Reading session #3

Articles

Python in Visual Studio Code – April 2021 Release

Support for Poetry environments

Better auto-completions for Pytorch using Pylance

Data Viewer Enhancements

Is manual ETL better than No-Code ETL: Are ETL tools dead?

The Explosion of Roles in Data Science

The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore

Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.

10 Tips and Tricks for Data Scientists Vol.4

Reading session #2

Articles

Bringing Machine Learning models into production without effort at Dailymotion

How To Run a Python Script Using a Docker Container

How to build a DAG Factory on Airflow