Jekyll2022-02-15T21:54:47+01:00https://bmaingret.github.io/feed.xmlBaptiste Maingret’s HomepagePersonal website I started during my training in data sciences.Baptiste MaingretUsing 1password, GPG and git for seamless commits signing2022-02-15T00:00:00+01:002022-02-15T00:00:00+01:00https://bmaingret.github.io/blog/1Password-gpg-git-seamless-commits-signing<p>Setting up gpg, git and 1password to have your git commits signed, whilestoring your GPG key passphrase into 1Password and unlocking it directly from your terminal.</p>
<!--more-->
<ul>
<li><a href="#requirements-and-setup">Requirements and setup</a></li>
<li><a href="#creating-a-gpg-key">Creating a GPG key</a></li>
<li><a href="#setting-up-git">Setting up git</a></li>
<li><a href="#setting-up-github">Setting up GitHub</a></li>
<li><a href="#setting-up-1password-cli">Setting up 1Password CLI</a></li>
<li><a href="#configuring-gpg-agent">Configuring gpg-agent</a></li>
<li><a href="#putting-it-all-together">Putting it all together</a>
<ul>
<li><a href="#finding-your-1password-entry">Finding your 1Password entry</a></li>
<li><a href="#getting-your-gpg-key-grip">Getting your GPG key grip</a></li>
<li><a href="#binding-the-two">Binding the two</a></li>
<li><a href="#running-at-login">Running at login</a></li>
</ul>
</li>
<li><a href="#testing-it">Testing it</a></li>
<li><a href="#sources">Sources</a></li>
</ul>
<h2 id="requirements-and-setup">Requirements and setup</h2>
<p>This was done on WSL2.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ uname -a
</span><span class="gp">Linux DESKTOP-AGPN69M 5.10.60.1-microsoft-standard-WSL2 #</span>1 SMP Wed Aug 25 23:20:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
<span class="go">❯ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
</span></code></pre></div></div>
<p>Make sure you have git and gpg installed, and an active 1Password account.</p>
<p>You’ll find written to use gpg2 when available on your system, but in my case both gpg and gpg2 were pointed to the same bin.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">❯ ls -l "$</span><span class="o">(</span>which gpg<span class="o">)</span><span class="s2">"
</span><span class="go">.rwxr-xr-x 1.1M root 6 Jan 2021 /usr/bin/gpg
</span><span class="gp">❯ ls -l "$</span><span class="o">(</span>which gpg2<span class="o">)</span><span class="s2">"
</span><span class="gp">lrwxrwxrwx 3 root 6 Jan 2021 /usr/bin/gpg2 -></span><span class="w"> </span><span class="s2">gpg
</span><span class="go">❯ gpg --version
gpg (GnuPG) 2.2.19
</span></code></pre></div></div>
<p>I will use <a href="https://stedolan.github.io/jq/">jq</a> but this is not essential.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ jq --version
jq-1.6
</span></code></pre></div></div>
<h2 id="creating-a-gpg-key">Creating a GPG key</h2>
<p>This is a summary of <a href="https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key">GitHub - Generating a new GPG key</a>, that I followed.</p>
<p>Run the following command and follow instructions.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ gpg --full-generate-key
</span></code></pre></div></div>
<p>Then list your fresh key.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ gpg --list-secret-keys --keyid-format=long
/home/baptiste/.gnupg/pubring.kbx
---------------------------------
sec rsa4096/0052A8D354A5C655 2022-02-09 [SC]
9BA03414AB56590B6DB5369F0052A8D354A5C655
</span><span class="gp">uid [ultimate] Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com></span><span class="w">
</span><span class="go">ssb rsa4096/A5B8C64E8929B475 2022-02-09 [E]
</span></code></pre></div></div>
<p>Look at the <code class="language-plaintext highlighter-rouge">sec</code> line and note the GPG key ID: <code class="language-plaintext highlighter-rouge">0052A8D354A5C655</code>.</p>
<p>Then we export the corresponding public key.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ gpg --armor --export 0052A8D354A5C655
-----BEGIN PGP PUBLIC KEY BLOCK-----
</span><span class="gp">#</span><span class="w"> </span>your public key
<span class="go">-----END PGP PUBLIC KEY BLOCK-----
</span></code></pre></div></div>
<p>Copy everything including the starting and ending blocks.</p>
<h2 id="setting-up-git">Setting up git</h2>
<p>First let’s tell <code class="language-plaintext highlighter-rouge">git</code> which key to use. Using your GPG key ID, run:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ git config --global user.signingkey 0052A8D354A5C655
</span></code></pre></div></div>
<p><strong>N.B</strong> This will configure it globally, you may need to configure it per repository depending on your usage.</p>
<p>Then we will tell <code class="language-plaintext highlighter-rouge">git</code> to sign <strong>every commit of every repository</strong>.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ git config --global commit.gpgsign true
</span></code></pre></div></div>
<h2 id="setting-up-github">Setting up GitHub</h2>
<p>Instructions may change. Check online documentation <a href="https://docs.github.com/en/authentication/managing-commit-signature-verification/adding-a-new-gpg-key-to-your-github-account">Adding a new GPG key to your GitHub account</a>.</p>
<p>TL;DR. <code class="language-plaintext highlighter-rouge">Settings > Access > New GPG key</code></p>
<h2 id="setting-up-1password-cli">Setting up 1Password CLI</h2>
<p>Instructions may change. Check online documentation. <a href="https://support.1password.com/command-line-getting-started/">1Password CLI: Getting started</a></p>
<p><strong>N.B.</strong> Version is hardcoded in URL, so make sure to check the official website to get latest URL.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">❯ curl -S https://cache.agilebits.com/dist/1P/op/pkg/v1.12.4/op_linux_amd64_v1.12.4.zip ></span><span class="w"> </span>op.zip
<span class="go"> % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3810k 100 3810k 0 0 4897k 0 --:--:-- --:--:-- --:--:-- 4891k
❯ unzip op.zip -d op
Archive: op.zip
extracting: op/op.sig
inflating: op/op
❯ gpg --keyserver hkps://keyserver.ubuntu.com --receive-keys 3FEF9748469ADBE15DA7CA80AC2D62742012EA22
</span><span class="gp">gpg: key AC2D62742012EA22: public key "Code signing for 1Password <codesign@1password.com></span><span class="s2">" imported
</span><span class="go">gpg: Total number processed: 1
gpg: imported: 1
❯ gpg --verify op/op.sig op/op
gpg: Signature made Fri Jan 14 22:38:08 2022 CET
gpg: using RSA key 3FEF9748469ADBE15DA7CA80AC2D62742012EA22
</span><span class="gp">gpg: Good signature from "Code signing for 1Password <codesign@1password.com></span><span class="s2">" [unknown]
</span><span class="go">gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner.
Primary key fingerprint: 3FEF 9748 469A DBE1 5DA7 CA80 AC2D 6274 2012 EA22
❯ sudo mv op /usr/bin
❯ op --version
1.12.4
</span></code></pre></div></div>
<p>Try to sign in.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">op signin my.1password.com your.email@example.com
</span></code></pre></div></div>
<h2 id="configuring-gpg-agent">Configuring gpg-agent</h2>
<p>We will make use of <code class="language-plaintext highlighter-rouge">gpg-preset-passphrase</code> to cache our passphrase for our key. For that we need to make sure <code class="language-plaintext highlighter-rouge">gpg-agent</code> allows it.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">❯ echo "allow-preset-passphrase" ></span><span class="o">></span> ~/.gnupg/gpg-agent.conf
</code></pre></div></div>
<h2 id="putting-it-all-together">Putting it all together</h2>
<h3 id="finding-your-1password-entry">Finding your 1Password entry</h3>
<p>I will assume you have a <code class="language-plaintext highlighter-rouge">1Password</code> entry storing your GPG key passphrase, with the name <code class="language-plaintext highlighter-rouge">GPG passphrase</code>.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ op get item "GPG passphrase" | jq ".uuid"
"vmgevmdnbbuui3evhksdftjhju"
</span></code></pre></div></div>
<h3 id="getting-your-gpg-key-grip">Getting your GPG key grip</h3>
<p>We list our keys and their key grips.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ gpg --list-secret-keys --with-keygrip
/home/baptiste/.gnupg/pubring.kbx
---------------------------------
sec rsa4096 2022-02-09 [SC]
9BA03414AB56590B6DB5369F0052A8D354A5C655
Keygrip = 80160C5055DA07978E939C0575A4E8DA0B1ECF27
</span><span class="gp">uid [ultimate] Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com></span><span class="w">
</span><span class="go">ssb rsa4096 2022-02-09 [E]
Keygrip = C04ACB8C33AAA68943194D7D1A56954BF76B5C2C
</span></code></pre></div></div>
<p>Look at the <code class="language-plaintext highlighter-rouge">sec</code> block and at the <code class="language-plaintext highlighter-rouge">Keygrip</code> entry: <code class="language-plaintext highlighter-rouge">80160C5055DA07978E939C0575A4E8DA0B1ECF27</code>.</p>
<h3 id="binding-the-two">Binding the two</h3>
<p>We will ask the 1Password to retrieve the password and pass it directly to <code class="language-plaintext highlighter-rouge">gpg-preset-passphrase</code> specifying our key grip. Note that <code class="language-plaintext highlighter-rouge">gpg-preset-passphrase</code> will read <code class="language-plaintext highlighter-rouge">stdin</code> by default.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">op get item vmgevmdnbbuui3evhksdftjhju --fields password | /usr/lib/gnupg2/gpg-preset-passphrase --preset 80160C5055DA07978E939C0575A4E8DA0B1ECF27
</span></code></pre></div></div>
<p>If you weren’t logged in 1Password, you will be asked to input your password.</p>
<h3 id="running-at-login">Running at login</h3>
<p>I am using <code class="language-plaintext highlighter-rouge">zsh</code> as a shell, so I will add the following to my <code class="language-plaintext highlighter-rouge">~/.zshrc</code> but should be able to do the same with <code class="language-plaintext highlighter-rouge">~/.bashrc</code> for instance. Note that if you are using <a href="https://github.com/romkatv/powerlevel10k">powerlevel10k</a>, you will need to put the following before the <code class="language-plaintext highlighter-rouge">instant-prompt</code> configuration.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function </span>gpg_cache <span class="o">()</span> <span class="o">{</span>
gpg-connect-agent /bye &> /dev/null <span class="c"># 1</span>
<span class="nb">eval</span> <span class="si">$(</span>op signin my<span class="si">)</span> <span class="c"># 2</span>
op get item vmgevmdnbbuui3evhksdftjhju <span class="nt">--fields</span> password | /usr/lib/gnupg2/gpg-preset-passphrase <span class="nt">--preset</span> 80160C5055DA07978E939C0575A4E8DA0B1ECF27 <span class="c"># 3</span>
<span class="o">}</span>
gpg_cache <span class="c"># 4</span>
</code></pre></div></div>
<ol>
<li><code class="language-plaintext highlighter-rouge">gpg-agent</code> is automatically started when required, however since <code class="language-plaintext highlighter-rouge">gpg</code> is not used in this but we still require <code class="language-plaintext highlighter-rouge">gpg-agent</code> we need to make sure it is started. This is the best way I found to achieve it.</li>
<li>Login in <code class="language-plaintext highlighter-rouge">1Password</code>.</li>
<li>Using our one-liner to retrieve the passphrase and cache it.</li>
<li>Calling our beautiful function.</li>
</ol>
<p><strong>N.B.</strong> This will require you to log in each time you start a session. You could also simply remove the call to <code class="language-plaintext highlighter-rouge">gpg_cache</code>, and call it from your terminal.</p>
<h2 id="testing-it">Testing it</h2>
<p>Go in one of your git repository, and let’s create a branch and try this out.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ git checkout -b signing
Switched to a new branch 'signing'
❯ touch dirty
❯ git add dirty
❯ git commit -m "Trying signing"
[signing 1426360] Trying signing
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 chapter-2/dirty
❯ git log --show-signature -1
</span><span class="gp">commit 1426360d301b88036feef02e00044e6ca62a9fd3 (HEAD -></span><span class="w"> </span>signing<span class="o">)</span>
<span class="go">gpg: Signature made Tue Feb 15 21:46:51 2022 CET
gpg: using RSA key 9BA03414AB56590B6DB5369F0052A8D354A5C655
</span><span class="gp">gpg: Good signature from "Baptiste Maingret (Home Desktop-WSL2) <baptiste.maingret@gmail.com></span><span class="s2">" [ultimate]
</span><span class="gp">Author: Baptiste Maingret <baptiste.maingret@gmail.com></span><span class="w">
</span><span class="go">Date: Tue Feb 15 21:46:51 2022 +0100
Trying signing
</span></code></pre></div></div>
<p>Remove our dirty work.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ git reset HEAD^
❯ rm dirty
❯ git checkout main
Switched to branch 'main'
❯ git branch -D signing
</span></code></pre></div></div>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://support.1password.com/command-line-getting-started/">1Password CLI - Getting Started</a></li>
<li><a href="https://docs.github.com/en/authentication/managing-commit-signature-verification">GitHub Docs - Commit signature</a></li>
<li><a href="https://stackoverflow.com/questions/38384957/prevent-git-from-asking-for-the-gnupg-password-during-signing-a-commit">Stackoverflow - Prevent git from asking for the GnuPG password during signing a commit</a></li>
</ul>Baptiste MaingretSetting up gpg, git and 1password to have your git commits signed, whilestoring your GPG key passphrase into 1Password and unlocking it directly from your terminal.Reading session #42021-12-20T00:00:00+01:002021-12-20T00:00:00+01:00https://bmaingret.github.io/blog/reading-session-4<h2 id="articles">Articles</h2>
<ul>
<li><a href="#how-ai-can-improve-products-for-people-with-impaired-speech">How AI can improve products for people with impaired speech</a></li>
<li><a href="#unlocking-human-rights-information-with-machine-learning">Unlocking human rights information with machine learning</a></li>
<li><a href="#a-9b-ai-failure-examined">A $9B AI Failure, Examined</a></li>
<li><a href="#machine-learning-model-development-and-model-operations-principles-and-practices">Machine Learning Model Development and Model Operations: Principles and Practices</a></li>
<li><a href="#avoid-these-mistakes-with-time-series-forecasting">Avoid These Mistakes with Time Series Forecasting</a></li>
</ul>
<!-- -->
<h2 id="how-ai-can-improve-products-for-people-with-impaired-speech">How AI can improve products for people with impaired speech</h2>
<p>Source:</p>
<ul>
<li><a href="https://blog.google/outreach-initiatives/accessibility/impaired-speech-recognition/">blog.google.com</a></li>
<li><a href="https://www.als.net/news/als-tdi-and-google-collaborate-to-bring-ai-to-als/">als.net</a></li>
</ul>
<blockquote>
<p>Some have recorded hundreds or thousands of specific phrases in order to train and optimize Google’s AI-based algorithms</p>
</blockquote>
<p>Voice recognition models are built from thousands of speech recording but none of them are from people suffering speech impairment. ALS TDI joined forces with Google and recruited people with ALS to record thousands of voice examples. By training their voice recognition models with those recordings they manage to improve their model to recognize impaired speech. They do not provide any specific number on the accuracy improvment.</p>
<p>Once again, data is the key in those deep learning models, but it also nice to see it working!</p>
<h2 id="unlocking-human-rights-information-with-machine-learning">Unlocking human rights information with machine learning</h2>
<p>Source: <a href="https://www.blog.google/outreach-initiatives/google-org/unlocking-human-rights-information-with-machine-learning/">blog.google.com</a></p>
<blockquote>
<p>they’ve built new tools that can automatically tag human rights documents so they are searchable — making the curation process 13 times faster</p>
</blockquote>
<p>Surveying the evolution of human rights accross the globe is a challenging and time consuming task! HURIDOCS built several models to help processing human rights information corpus, extracting and classifying relevant data.</p>
<p>In addition to winning the Peace and Justice Strong Institutions Award at the 2021 edition of CogX, they also offer a key tool as open source: <a href="https://huridocs.org/technology/uwazi/">Uwazi</a> that allows human rights defenders to store, organize and search through collections of human rights information.</p>
<h2 id="a-9b-ai-failure-examined">A $9B AI Failure, Examined</h2>
<p>Source: <a href="https://www.linkedin.com/pulse/9b-ai-fail-gianluca-mauro/">linkedin.com</a></p>
<p>This has already made the news multiple times: Zillow, an online real estate marketplace, lost hunderds of millions in addition to their stock going down. And this because of a poort usage of a ML model.</p>
<p>Basically one of their business was to estimate house prices and buy them to their owner in the perspective of making a benefit on its future sell. For that they use a ML algorithm which presumably was from the Kaggle competition they conducted. Sadly the real estate market is not so simple. The article points out a few key issues:</p>
<ul>
<li>Real estate market is not stable and is extremely subject to external effects (c.f. Covid-19)</li>
<li>Zillow wanted to renovate home before selling them. However what happens if instead of selling the home in 2 months, you have to wait 6, 12 or even more because of delays…</li>
<li>You have to think upfront who is willing to sell their home fast directly to Zillow. No visit of the property before buying?!</li>
<li>This one is very interesting: if your model is right 95% of the time, meaning 2.5% of the time it will be under the right price, it is safe to assume most people won’t sell you these houses. This would end up with only buying houses either at the right price or to high.</li>
<li>Sometimes the smallest change makes all the difference. AI models can’t encode them all. For example a difference of a few street number can drastically change the price.</li>
<li>Last one is golden: a senior DS job offer focusing on Facebook Prophet library skill. This might show that the ML managment behind Zillow’s algorithm may not have had the focus on the right things.</li>
</ul>
<h2 id="machine-learning-model-development-and-model-operations-principles-and-practices">Machine Learning Model Development and Model Operations: Principles and Practices</h2>
<p>Source: <a href="https://www.kdnuggets.com/2021/10/machine-learning-model-development-operations-principles-practice.html">kdnuggets.com</a></p>
<p>This article summarize all the steps of deploying a ML model in production, including key part as model performance monitoring or model version management. I think this article transpires one key issue as of today: the number of steps and tools can be overwhelming. You can go the simple way, with custom Python code and some DevOps principles, but that either will not scale very well or will rqeuire a lot of effort and ressources to grow. And once you have a lot of resources available, it will make it more difficult to work on the same project. Cloud platform solutions can then be of help, making complex tools easily available but at the cost of being more tightly coupled with their platform.</p>
<h2 id="avoid-these-mistakes-with-time-series-forecasting">Avoid These Mistakes with Time Series Forecasting</h2>
<p>Source: <a href="https://www.kdnuggets.com/2021/12/avoid-mistakes-time-series-forecasting.html">kdnuggets.com</a></p>
<p>Sometimes we want to quickly see if our data is anything but random, so we generate random samples and compare simple metrics to see if there is any significant difference. However if we don’t pick the right distribution or random generator, it is easy to get fouled.</p>
<p>In this example, they make the point that market stock price time series shouldn’t be compared to random samples drawn from a normal distribution, but to random walk generated numbers.</p>
<p>In the same way, it is easy to stop as soon as we identify any relevant differences. Such when they compare the differenced time series which seem to be different, but once you compare their autocorrelation, this is not the case anymore.</p>
<p>It is easy to fall in “you only find what you are looking for” pit, and one must be careful and prefer a consistant approach to data anlysis.</p>Baptiste MaingretArticles How AI can improve products for people with impaired speech Unlocking human rights information with machine learning A $9B AI Failure, Examined Machine Learning Model Development and Model Operations: Principles and Practices Avoid These Mistakes with Time Series Forecasting How AI can improve products for people with impaired speech Source: blog.google.com als.net Some have recorded hundreds or thousands of specific phrases in order to train and optimize Google’s AI-based algorithms Voice recognition models are built from thousands of speech recording but none of them are from people suffering speech impairment. ALS TDI joined forces with Google and recruited people with ALS to record thousands of voice examples. By training their voice recognition models with those recordings they manage to improve their model to recognize impaired speech. They do not provide any specific number on the accuracy improvment. Once again, data is the key in those deep learning models, but it also nice to see it working! Unlocking human rights information with machine learning Source: blog.google.com they’ve built new tools that can automatically tag human rights documents so they are searchable — making the curation process 13 times faster Surveying the evolution of human rights accross the globe is a challenging and time consuming task! HURIDOCS built several models to help processing human rights information corpus, extracting and classifying relevant data. In addition to winning the Peace and Justice Strong Institutions Award at the 2021 edition of CogX, they also offer a key tool as open source: Uwazi that allows human rights defenders to store, organize and search through collections of human rights information. A $9B AI Failure, Examined Source: linkedin.com This has already made the news multiple times: Zillow, an online real estate marketplace, lost hunderds of millions in addition to their stock going down. And this because of a poort usage of a ML model. Basically one of their business was to estimate house prices and buy them to their owner in the perspective of making a benefit on its future sell. For that they use a ML algorithm which presumably was from the Kaggle competition they conducted. Sadly the real estate market is not so simple. The article points out a few key issues: Real estate market is not stable and is extremely subject to external effects (c.f. Covid-19) Zillow wanted to renovate home before selling them. However what happens if instead of selling the home in 2 months, you have to wait 6, 12 or even more because of delays… You have to think upfront who is willing to sell their home fast directly to Zillow. No visit of the property before buying?! This one is very interesting: if your model is right 95% of the time, meaning 2.5% of the time it will be under the right price, it is safe to assume most people won’t sell you these houses. This would end up with only buying houses either at the right price or to high. Sometimes the smallest change makes all the difference. AI models can’t encode them all. For example a difference of a few street number can drastically change the price. Last one is golden: a senior DS job offer focusing on Facebook Prophet library skill. This might show that the ML managment behind Zillow’s algorithm may not have had the focus on the right things. Machine Learning Model Development and Model Operations: Principles and Practices Source: kdnuggets.com This article summarize all the steps of deploying a ML model in production, including key part as model performance monitoring or model version management. I think this article transpires one key issue as of today: the number of steps and tools can be overwhelming. You can go the simple way, with custom Python code and some DevOps principles, but that either will not scale very well or will rqeuire a lot of effort and ressources to grow. And once you have a lot of resources available, it will make it more difficult to work on the same project. Cloud platform solutions can then be of help, making complex tools easily available but at the cost of being more tightly coupled with their platform. Avoid These Mistakes with Time Series Forecasting Source: kdnuggets.com Sometimes we want to quickly see if our data is anything but random, so we generate random samples and compare simple metrics to see if there is any significant difference. However if we don’t pick the right distribution or random generator, it is easy to get fouled. In this example, they make the point that market stock price time series shouldn’t be compared to random samples drawn from a normal distribution, but to random walk generated numbers. In the same way, it is easy to stop as soon as we identify any relevant differences. Such when they compare the differenced time series which seem to be different, but once you compare their autocorrelation, this is not the case anymore. It is easy to fall in “you only find what you are looking for” pit, and one must be careful and prefer a consistant approach to data anlysis.Python and Poetry on Docker2021-11-15T00:00:00+01:002021-11-15T00:00:00+01:00https://bmaingret.github.io/blog/Docker-and-Poetry<p>Build a multi-stage Docker image from official Python images with support for Poetry projects.</p>
<p><a href="https://github.com/bmaingret/coach-planner">Source code on Github</a></p>
<!--more-->
<p>Updated following <a href="https://github.com/bmaingret/coach-planner/issues">issues</a> on the GitHub repository.</p>
<p>Sources:</p>
<ul>
<li><a href="https://pythonspeed.com/articles/base-image-python-docker-images/">Python=>Speed</a></li>
<li><a href="https://hub.docker.com/_/python">Python images on docker.com</a></li>
<li><a href="https://github.com/michaeloliverx/python-poetry-docker-example/blob/master/docker/Dockerfile">Dockerfile on github.com/michaeloliverx</a></li>
<li><a href="https://www.mktr.ai/the-data-scientists-quick-guide-to-dockerfiles-with-examples">Dockerfiles on mktr.ai</a></li>
<li><a href="https://github.com/python-poetry/poetry/discussions/1879">Discussions on Github Poetry</a></li>
</ul>
<h2 id="summary">Summary</h2>
<ul>
<li><a href="#multi-stage-build">Multi-stage build</a></li>
<li><a href="#choosing-a-base-version">Choosing a base version</a></li>
<li><a href="#stage-staging">Stage: Staging</a>
<ul>
<li><a href="#arg-and-environment-variables">ARG and environment variables</a></li>
<li><a href="#install-poetry">Install Poetry</a></li>
<li><a href="#source-file-and-dependencies">Source file and dependencies</a></li>
</ul>
</li>
<li><a href="#stage-development">Stage: Development</a>
<ul>
<li><a href="#install-our-project">Install our project</a></li>
<li><a href="#flask-webserver-and-entrypoint">Flask webserver and entrypoint</a></li>
</ul>
</li>
<li><a href="#stage-build">Stage: Build</a></li>
<li><a href="#stage-production">Stage: Production</a>
<ul>
<li><a href="#environment-variables">Environment variables</a></li>
<li><a href="#installating-our-application">Installating our application</a></li>
<li><a href="#entrypoint">Entrypoint</a></li>
</ul>
</li>
<li><a href="#build-our-image-and-use-it">Build our image and use it!</a>
<ul>
<li><a href="#production-image">Production image</a></li>
<li><a href="#development-image">Development image</a></li>
</ul>
</li>
</ul>
<h2 id="multi-stage-build">Multi-stage build</h2>
<p>A multi-stage build allows:</p>
<ul>
<li>to stop at a specific step of a build</li>
<li>start from different base thus beginning a new stage of the bulid</li>
<li>pass artifacts from one build to another</li>
</ul>
<p>I started this while looking for the best way to use both Docker and Poetry, and stumble upon an quite complete dockerfile at <a href="https://github.com/michaeloliverx/python-poetry-docker-example/blob/master/docker/Dockerfile">github.com/michaeloliverx/python-poetry-docker-example</a>. The author use a multi-stage build to offer several images for development, testing, linting and production. Stopping at a specific images allows to minimize the image size and time to run for each step. However I was not completly happy with this example which was a bit too complex for me and I wanted to dig in anyway.</p>
<p>In our case we will have the following stages:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">staging</code>: Installs poetry and copy relevant source files</li>
<li><code class="language-plaintext highlighter-rouge">development</code>: Install our project in editable mode</li>
<li><code class="language-plaintext highlighter-rouge">build</code>: Build our project into a wheel file</li>
<li><code class="language-plaintext highlighter-rouge">production</code>: Clean Python image that installs our built wheel</li>
</ul>
<h2 id="choosing-a-base-version">Choosing a base version</h2>
<p>To start of our image we need to chose the base image, with two obvious options:</p>
<ol>
<li>Official Linux images (Ubuntu, Debian, RHEL)</li>
<li>Official Python images</li>
</ol>
<p>I discarded the first one as you don’t always have the most recent Python versions, and I am not so worried about performance differences pointed out by <a href="https://pythonspeed.com/articles/base-image-python-docker-images/">Python=>Speed</a>.</p>
<p>Regarding <a href="https://hub.docker.com/_/python">official Python images on docker.com</a>, we still have to chose the tag (i.e. flavor) we want:</p>
<ol>
<li>python:<version>(-<debian-codename>): based of a specific or not Debian version with common packages installed</debian-codename></version></li>
<li>python:<versoins>-slim: based of a specific or not Debian version with only the strict requirements for working Python environment</versoins></li>
<li>python:<version>-alpine: discarded because although much smaller, it brings complexity specific to the Alpine distribution</version></li>
</ol>
<p>I chose 1 by default, the recommended version and including building tools that are needed anyway.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> python:3.10.0 as python-base</span>
</code></pre></div></div>
<h2 id="stage-staging">Stage: Staging</h2>
<h3 id="arg-and-environment-variables">ARG and environment variables</h3>
<p>We set up a few environment variables for Python, Pip and Poetry configurations.</p>
<p>A few things to keep in mind:</p>
<ul>
<li>Dockerfile doesn’t support reference to previously defined environment variables in the same <code class="language-plaintext highlighter-rouge">ENV</code> instruction.</li>
<li><code class="language-plaintext highlighter-rouge">ARG</code> defined before a stage can be used by referencing them again in the stage.</li>
<li>When using a different base image than the previous stages, <code class="language-plaintext highlighter-rouge">ENV</code> variables won’t be defined anymore. (seems obvious once said…)</li>
</ul>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ARG</span><span class="s"> APP_NAME=coach_planner</span>
<span class="k">ARG</span><span class="s"> APP_PATH=/opt/$APP_NAME</span>
<span class="k">ARG</span><span class="s"> PYTHON_VERSION=3.10.0</span>
<span class="k">ARG</span><span class="s"> POETRY_VERSION=1.1.11</span>
</code></pre></div></div>
<ol>
<li>Python process will be ran only once in the container so we don’t need to write the compiled Python files (*.pyc) to disk</li>
<li>Make sure Python outputs are sent straight to terminal</li>
<li>Make sure Python traceback are dumped (even on segfaults for instance)</li>
<li>No need for the cache in the Docker image</li>
</ol>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ENV</span><span class="s"> \</span>
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONFAULTHANDLER=1
<span class="k">ENV</span><span class="s"> \</span>
POETRY_VERSION=$POETRY_VERSION \
POETRY_HOME="/opt/poetry" \
POETRY_VIRTUALENVS_IN_PROJECT=true \
POETRY_NO_INTERACTION=1
</code></pre></div></div>
<ol>
<li>We won’t update the pip version in any case</li>
<li>Default timeout is only 15 seconds</li>
<li>Fixing Poetry version</li>
<li>Instead of /root</li>
<li>Make sure the <code class="language-plaintext highlighter-rouge">.venv</code> directory will be in the build directory</li>
<li>No prompt from Poetry</li>
<li>Path for building stages</li>
<li>Add the virtual environment to path in a separate <code class="language-plaintext highlighter-rouge">ENV</code>line to use previously defined environment variables.</li>
<li>Update path with Poetry and virtual env path.</li>
</ol>
<h3 id="install-poetry">Install Poetry</h3>
<p>Installation follow Poetry’s official documentation and make use of the new install script supporting the upcoming Poetry version. We need to update our <code class="language-plaintext highlighter-rouge">PATH</code> to be able to use poetry afterwards.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install Poetry - respects $POETRY_VERSION & $POETRY_HOME</span>
<span class="k">RUN </span>curl <span class="nt">-sSL</span> https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python
<span class="k">ENV</span><span class="s"> PATH="$POETRY_HOME/bin:$PATH"</span>
</code></pre></div></div>
<h3 id="source-file-and-dependencies">Source file and dependencies</h3>
<p>To copy our source files, we make the assumption that the directory structure follows the one poetry would create:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry-demo
├── pyproject.toml
├── poetry_demo
│ └── __init__.py
</code></pre></div></div>
<p>We also obviously import the <code class="language-plaintext highlighter-rouge">poetry.lock</code> file.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WORKDIR</span><span class="s"> $APP_PATH</span>
<span class="k">COPY</span><span class="s"> ./poetry.lock ./pyproject.toml ./</span>
<span class="k">COPY</span><span class="s"> ./$APP_NAME ./$APP_NAME</span>
</code></pre></div></div>
<h2 id="stage-development">Stage: Development</h2>
<p>Make sure to specify the <code class="language-plaintext highlighter-rouge">ARG</code> we need in this stage.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> staging as development</span>
<span class="k">ARG</span><span class="s"> APP_NAME</span>
<span class="k">ARG</span><span class="s"> APP_PATH</span>
</code></pre></div></div>
<h3 id="install-our-project">Install our project</h3>
<p>Nothing fancy here, we make use of poetry command.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WORKDIR</span><span class="s"> $APP_PATH</span>
<span class="k">RUN </span>poetry <span class="nb">install</span>
</code></pre></div></div>
<h3 id="flask-webserver-and-entrypoint">Flask webserver and entrypoint</h3>
<p>In development mode we use the default flask webserver. We first define a few flask related environment variable, and then define an entrypooint as the <code class="language-plaintext highlighter-rouge">flask run</code> command. This run the following command in the activated virtual environment of the project. This has several advantages:</p>
<ul>
<li>no direct <code class="language-plaintext highlighter-rouge">path</code> manipulation</li>
<li>express the intented use of this stage</li>
<li>document how to use this stage</li>
<li>easy command override while not bothering with the virtual environment. e.g. <code class="language-plaintext highlighter-rouge">docker run -it poetry flask shell</code>.</li>
</ul>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ENV</span><span class="s"> FLASK_APP=$APP_NAME \</span>
FLASK_ENV=development \
FLASK_RUN_HOST=0.0.0.0 \
FLASK_RUN_PORT=8888
<span class="k">ENTRYPOINT</span><span class="s"> ["poetry", "run"]</span>
<span class="k">CMD</span><span class="s"> ["flask", "run"]</span>
</code></pre></div></div>
<p>We can still access a shell by overriding the entrypoint, but that shouldn’t be the most common use case imho. Note that in this cas one should activate the environement.</p>
<h2 id="stage-build">Stage: Build</h2>
<p>We first use the <code class="language-plaintext highlighter-rouge">poetry build</code> command, and add the <code class="language-plaintext highlighter-rouge">--flag wheel</code> parameter to only build the wheel.</p>
<p>Then we use <code class="language-plaintext highlighter-rouge">poetry export</code> to get a file containing dependency version contrainsts for our future pip installation. We pass the <code class="language-plaintext highlighter-rouge">--without-hashes</code> but this could be removed and take part of <code class="language-plaintext highlighter-rouge">pip install --require-hashes</code>.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> staging as build</span>
<span class="k">ARG</span><span class="s"> APP_PATH</span>
<span class="k">WORKDIR</span><span class="s"> $APP_PATH</span>
<span class="k">RUN </span>poetry build <span class="nt">--format</span> wheel
<span class="k">RUN </span>poetry <span class="nb">export</span> <span class="nt">--format</span> requirements.txt <span class="nt">--output</span> constraints.txt <span class="nt">--without-hashes</span>
</code></pre></div></div>
<h2 id="stage-production">Stage: Production</h2>
<p>For our production, we will start from a clean python image, and install our freshly built application.</p>
<h3 id="environment-variables">Environment variables</h3>
<p>We redefined some Python related environemnt variable (required since we start from a fresh image), and we add some directly related to PIP. Note that again we reference our <code class="language-plaintext highlighter-rouge">ARG</code> to be able to use them in this stage.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> python:$PYTHON_VERSION as production</span>
<span class="k">ARG</span><span class="s"> APP_NAME</span>
<span class="k">ARG</span><span class="s"> APP_PATH</span>
<span class="k">ENV</span><span class="s"> \</span>
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONFAULTHANDLER=1
<span class="k">ENV</span><span class="s"> \</span>
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on \
PIP_DEFAULT_TIMEOUT=100
</code></pre></div></div>
<h3 id="installating-our-application">Installating our application</h3>
<p>We first retrieve the packaged application and constraints file from the <code class="language-plaintext highlighter-rouge">build</code> stage using the <code class="language-plaintext highlighter-rouge">--from</code> flag of the <code class="language-plaintext highlighter-rouge">CP</code> command. Then we proceed with the installation using PIP.</p>
<p>Note that we make use of wildcards, but we could also add the application version similarly to Python and Poetry versions, to get a more determinist install.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WORKDIR</span><span class="s"> $APP_PATH</span>
<span class="k">COPY</span><span class="s"> --from=build $APP_PATH/dist/*.whl ./</span>
<span class="k">COPY</span><span class="s"> --from=build $APP_PATH/constraints.txt ./</span>
<span class="k">RUN </span>pip <span class="nb">install</span> ./<span class="nv">$APP_NAME</span><span class="k">*</span>.whl <span class="nt">--constraint</span> constraints.txt
</code></pre></div></div>
<h3 id="entrypoint">Entrypoint</h3>
<p>We will use <code class="language-plaintext highlighter-rouge">gunicorn</code> to serve our application in production.</p>
<p>We define two environment variables that will be used in the <code class="language-plaintext highlighter-rouge">gunicorn</code> command allowing to be overriden (mostly for the <code class="language-plaintext highlighter-rouge">PORT</code>). This can be used when deplyoing on GCP Cloud Run for instance.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># gunicorn port. Naming is consistent with GCP Cloud Run</span>
<span class="k">ENV</span><span class="s"> PORT=8888 </span>
<span class="c"># export APP_NAME as environment variable for the CMD</span>
<span class="k">ENV</span><span class="s"> APP_NAME=$APP_NAME</span>
</code></pre></div></div>
<p>The difference between <code class="language-plaintext highlighter-rouge">ENTRYPOINT</code> and <code class="language-plaintext highlighter-rouge">CMD</code> can be confusing especially when used together, and I would recommend reading <a href="https://docs.docker.com/engine/reference/builder/#understand-how-cmd-and-entrypoint-interact">Docker documentation on the topic</a>. In our case we need shell variable substitution for the environment variable so it limits our choice. Other alternative would be to use a <code class="language-plaintext highlighter-rouge">config.py</code> file.</p>
<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">COPY</span><span class="s"> ./docker/docker-entrypoint.sh /docker-entrypoint.sh # 1</span>
<span class="k">RUN </span><span class="nb">chmod</span> +x /docker-entrypoint.sh <span class="c"># 1</span>
<span class="k">ENTRYPOINT</span><span class="s"> ["/docker-entrypoint.sh"] # 2</span>
<span class="k">CMD</span><span class="s"> ["gunicorn", "--bind :$PORT", "--workers 1", "--threads 1", "--timeout 0", "\"$APP_NAME:create_app()\""] # 3</span>
</code></pre></div></div>
<ol>
<li>Get the entrypoint script (see below) and make it executable</li>
<li>With this syntax, this is equivalent to <code class="language-plaintext highlighter-rouge">exec /docker-entrypoint.sh</code></li>
<li>These arguments will be passed to the entrypoint scripts and can be overriden by the arguments passed to <code class="language-plaintext highlighter-rouge">docker run</code></li>
</ol>
<p>And the script:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span> <span class="c"># 1 </span>
<span class="nb">eval</span> <span class="s2">"exec </span><span class="nv">$@</span><span class="s2">"</span> <span class="c"># 2</span>
</code></pre></div></div>
<ol>
<li>Will exit the script if any error occurs</li>
<li>Expand arguments passed to the entrypoint script in the shell before passing them to <code class="language-plaintext highlighter-rouge">exec</code>, so that it can support passing environment variable.</li>
</ol>
<h2 id="build-our-image-and-use-it">Build our image and use it!</h2>
<h3 id="production-image">Production image</h3>
<p>Let’s build our image and try using it.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker build <span class="nt">--tag</span> poetry <span class="nt">--file</span> docker/Dockerfile <span class="nb">.</span>
<span class="go">
[+] Building 33.3s (17/17) FINISHED
</span><span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load build definition from Dockerfile 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring dockerfile: 2.13kB 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load .dockerignore 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring context: 2B 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load metadata <span class="k">for </span>docker.io/library/python:3.10.0 1.2s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load build context 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring context: 332B 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>staging 1/5] FROM docker.io/library/python:3.10.0@sha256:bb797f045026352 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> resolve docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab376c5666 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>production 2/6] WORKDIR /opt/coach_planner 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>staging 2/5] RUN curl <span class="nt">-sSL</span> https://raw.githubusercontent.com/python-poetry/poe 22.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>staging 3/5] WORKDIR /opt/coach_planner 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>staging 4/5] COPY ./poetry.lock ./pyproject.toml ./ 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>staging 5/5] COPY ./coach_planner ./coach_planner 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>build 1/2] WORKDIR /opt/coach_planner 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>build 2/2] RUN poetry build <span class="nt">--format</span> wheel 2.2s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>production 3/6] COPY <span class="nt">--from</span><span class="o">=</span>build /opt/coach_planner/dist/<span class="k">*</span>.whl ./ 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>production 4/6] RUN pip <span class="nb">install</span> ./coach_planner<span class="k">*</span>.whl 3.5s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>production 5/6] COPY ./docker/docker-entrypoint.sh /docker-entrypoint.sh 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>production 6/6] RUN <span class="nb">chmod</span> +x /docker-entrypoint.sh 0.6s
<span class="gp"> =></span><span class="w"> </span>exporting to image 0.2s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> exporting layers 0.2s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> writing image sha256:a612b9cdb91bacb1cefd1393318a5d15238034183fd48dbbeafffdd7 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> naming to docker.io/library/poetry 0.0s
</code></pre></div></div>
<p>Note that there is only the <code class="language-plaintext highlighter-rouge">staging</code>, <code class="language-plaintext highlighter-rouge">build</code>, and <code class="language-plaintext highlighter-rouge">production</code> stages here. The <code class="language-plaintext highlighter-rouge">development</code> stage is not required for the final (i.e. default) stage, it is not even built.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker run <span class="nt">-p</span> 8888:8888 <span class="nt">-it</span> poetry
<span class="go">
[2021-11-15 18:17:59 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-11-15 18:17:59 +0000] [1] [INFO] Listening at: http://0.0.0.0:8888 (1)
[2021-11-15 18:17:59 +0000] [1] [INFO] Using worker: sync
[2021-11-15 18:17:59 +0000] [11] [INFO] Booting worker with pid: 11
</span></code></pre></div></div>
<p>Let’s change our gunicorn binding port:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker run <span class="nt">-p</span> 8888:8888 <span class="nt">--env</span> <span class="nv">PORT</span><span class="o">=</span>5555 <span class="nt">-it</span> poetry
<span class="go">
[2021-11-17 18:21:28 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-11-17 18:21:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:5555 (1)
[2021-11-17 18:21:28 +0000] [1] [INFO] Using worker: sync
[2021-11-17 18:21:28 +0000] [7] [INFO] Booting worker with pid: 7
</span></code></pre></div></div>
<h3 id="development-image">Development image</h3>
<p>Now let’s use our development stage:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker build <span class="nt">--target</span> development <span class="nt">-t</span> po
<span class="go">etry --file docker/Dockerfile .
[+] Building 1.4s (12/12) FINISHED
</span><span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load build definition from Dockerfile 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring dockerfile: 38B 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load .dockerignore 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring context: 2B 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load metadata <span class="k">for </span>docker.io/library/python:3.10.0 1.2s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>internal] load build context 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> transferring context: 260B 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">[</span>staging 1/5] FROM docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> resolve docker.io/library/python:3.10.0@sha256:bb797f045026352ece65ab376c5666 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>staging 2/5] RUN curl <span class="nt">-sSL</span> https://raw.githubusercontent.com/python-poet 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>staging 3/5] WORKDIR /opt/coach_planner 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>staging 4/5] COPY ./poetry.lock ./pyproject.toml ./ 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>staging 5/5] COPY ./coach_planner ./coach_planner 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>development 1/2] WORKDIR /opt/coach_planner 0.0s
<span class="gp"> =></span><span class="w"> </span>CACHED <span class="o">[</span>development 2/2] RUN poetry <span class="nb">install </span>0.0s
<span class="gp"> =></span><span class="w"> </span>exporting to image 0.1s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> exporting layers 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> writing image sha256:cae979092d3b1e2c833e96f2e9acdfbd6980609f36d907a6547f05f1 0.0s
<span class="gp"> =></span><span class="w"> </span><span class="o">=></span> naming to docker.io/library/poetry 0.0s
</code></pre></div></div>
<p>Here only the <code class="language-plaintext highlighter-rouge">staging</code> and <code class="language-plaintext highlighter-rouge">developement</code> stages are run!</p>
<p>And using it:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker run <span class="nt">-p</span> 8888:8888 <span class="nt">-it</span> poetry
<span class="go">
* Serving Flask app 'coach_planner' (lazy loading)
* Environment: development
* Debug mode: on
* Running on all addresses.
WARNING: This is a development server. Do not use it in a production deployment.
* Running on http://172.17.0.2:8888/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 774-185-805
</span></code></pre></div></div>
<p>Or overriding the default <code class="language-plaintext highlighter-rouge">CMD</code> and getting a Python shell:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker run <span class="nt">-p</span> 8888:8888 <span class="nt">-it</span> poetry python
<span class="go">
Python 3.10.0 (default, Oct 26 2021, 22:20:53) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
</span><span class="gp">></span><span class="o">>></span>
</code></pre></div></div>
<p>Or even overriding the default <code class="language-plaintext highlighter-rouge">ENTRYPOINT</code> and then starting a poetry shell:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>docker run <span class="nt">-p</span> 8888:8888 <span class="nt">--entrypoint</span> /bin/bash <span class="nt">-it</span> poetry
<span class="go">
</span><span class="gp">root@4dd4bb68f510:/opt/coach_planner#</span><span class="w"> </span>poetry shell
<span class="go">Spawning shell within /opt/coach_planner/.venv
</span><span class="gp">root@4dd4bb68f510:/opt/coach_planner#</span><span class="w"> </span><span class="nb">.</span> /opt/coach_planner/.venv/bin/activate
<span class="gp">(.venv) root@4dd4bb68f510:/opt/coach_planner#</span><span class="w">
</span></code></pre></div></div>Baptiste MaingretBuild a multi-stage Docker image from official Python images with support for Poetry projects. Source code on GithubReading notes for Architecture Patterns with Python by Harry Percival, Bob Gregory2021-08-13T00:00:00+02:002021-08-13T00:00:00+02:00https://bmaingret.github.io/blog/architecture-patterns-with-python<p>Notes, references and codes I wrote while reading and coding along <a href="https://www.oreilly.com/library/view/architecture-patterns-with/9781492052197/"><code class="language-plaintext highlighter-rouge">Architecture Patterns with Python</code> by Harry Percival, Bob Gregory - O’Reilly</a>.</p>
<p><strong>Work in progress</strong></p>
<!--more-->
<p>My code along repository <a href="https://github.com/bmaingret/architecture-patterns-code-along">bmaingret/architecture-patterns-code-along</a>.</p>
<p>The full code by the authors is also available, as well as their book at <a href="https://github.com/cosmicpython/">github.com/cosmicpython</a>.</p>
<ul>
<li><a href="#chapter-1---domain-model">Chapter 1 - Domain model</a>
<ul>
<li><a href="#diving-in-the-domain-model">Diving in the domain model</a></li>
<li><a href="#value-object-pattern">Value Object Pattern</a></li>
<li><a href="#domain-entity">Domain entity</a></li>
<li><a href="#not-everything-must-be-in-a-class">Not everything must be in a class</a></li>
<li><a href="#exceptions-as-domain-concepts">Exceptions as domain concepts</a></li>
</ul>
</li>
<li><a href="#chapter-2---repository-pattern">Chapter 2 - Repository Pattern</a>
<ul>
<li><a href="#repository-pattern">Repository pattern</a></li>
<li><a href="#port-and-adapter">Port and Adapter</a></li>
<li><a href="#orm">ORM</a></li>
</ul>
</li>
<li><a href="#interlude---reproducibility-and-continuous-integration">Interlude - Reproducibility and Continuous Integration</a>
<ul>
<li><a href="#development-and-production-environment---python-mess">Development and Production Environment - Python mess</a></li>
<li><a href="#github-actions">Github Actions</a></li>
<li><a href="#git-hooks---pre-commit">Git hooks - Pre-commit</a></li>
</ul>
</li>
<li><a href="#chapter-3---coupling-and-abstractions">Chapter 3 - Coupling and Abstractions</a></li>
<li><a href="#chapter-4---service-layer-pattern">Chapter 4 - Service Layer pattern</a></li>
<li><a href="#chapter-5---tdd-in-high-gear-and-low-gear">Chapter 5 - TDD in High Gear and Low Gear</a></li>
<li><a href="#chapter-6---unit-of-work">Chapter 6 - Unit Of Work</a></li>
<li><a href="#chapter-7---aggregate-and-consistency-boundaries">Chapter 7 - Aggregate and Consistency Boundaries</a>
<ul>
<li><a href="#handling-concurrency">Handling concurrency</a></li>
</ul>
</li>
<li><a href="#chapter-8---events-and-the-message-bus">Chapter 8 - Events and the Message Bus</a></li>
</ul>
<h2 id="chapter-1---domain-model">Chapter 1 - Domain model</h2>
<blockquote>
<p>The domain is a fancy way of saying the problem you’re trying to solve</p>
</blockquote>
<blockquote>
<p>The domain model is the mental map that business owners have of their businesses</p>
</blockquote>
<h3 id="diving-in-the-domain-model">Diving in the domain model</h3>
<ul>
<li>Understand the business jargon and keeps a glossary</li>
<li>Get concrete examples of the rules defining the domain model</li>
<li>TDD: translates those rules into unit tests</li>
</ul>
<h3 id="value-object-pattern">Value Object Pattern</h3>
<blockquote>
<p>any domain object that is uniquely identified by the data it holds; we usually make them immutable</p>
</blockquote>
<p>Which in Python can easily be converted to frozen dataclasses, offering hash for free.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">OrderLine</span><span class="p">:</span>
<span class="n">order_reference</span><span class="p">:</span> <span class="nb">str</span>
<span class="n">sku</span><span class="p">:</span> <span class="nb">str</span>
<span class="n">quantity</span><span class="p">:</span> <span class="nb">int</span>
</code></pre></div></div>
<p>Note however than because SQLAlchemy modifies the class at runtime, we have to use <code class="language-plaintext highlighter-rouge">unsafe_hash=True</code> instead…</p>
<h3 id="domain-entity">Domain entity</h3>
<blockquote>
<p>Domain object that has long-lived identity</p>
</blockquote>
<p>Usually mutable and have a fixed identity not depending on their values.</p>
<blockquote>
<p>We usually make this explicit in code by implementing equality operators on entities:</p>
</blockquote>
<p>In Python that means defining the <code class="language-plaintext highlighter-rouge">__eq__</code> operator.</p>
<p>Careful when defining hash, which basically means identifying what uniquely defines an entity along its life.</p>
<h3 id="not-everything-must-be-in-a-class">Not everything must be in a class</h3>
<p>Domain Service Function: in Python we can put the function in the module without making it more than that.</p>
<h3 id="exceptions-as-domain-concepts">Exceptions as domain concepts</h3>
<p>Business errors/exceptions can be nicely represented by exceptions</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">OutOfStock</span><span class="p">(</span><span class="nb">Exception</span><span class="p">):</span>
<span class="k">pass</span>
</code></pre></div></div>
<h2 id="chapter-2---repository-pattern">Chapter 2 - Repository Pattern</h2>
<h3 id="repository-pattern">Repository pattern</h3>
<p>Make an abstraction around the storage. Looks like everything is stored in-memory and allows for</p>
<h3 id="port-and-adapter">Port and Adapter</h3>
<p>Port usually is some interface, and adapter its implementation. In Python, this usually translates to some abstract base class and its implementation, but it can also be an implicit duck type port.</p>
<h3 id="orm">ORM</h3>
<p>ORM can lead to high dependency towards the ORM framework, and one must be careful to invert the dependency and makes the ORM depends on the domain models instead.</p>
<h2 id="interlude---reproducibility-and-continuous-integration">Interlude - Reproducibility and Continuous Integration</h2>
<p>Although I had read a lot and knew this was the way to go, I never took the time to implement it in any of my project, so I thought this would be a good opportunity, even more since the original authors take this path as well (Makefile and Docker).</p>
<h3 id="development-and-production-environment---python-mess">Development and Production Environment - Python mess</h3>
<blockquote>
<p>I mostly followed <a href="">https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry</a></p>
<p>I had so much trouble making everything work under plain Windows, that I moved all my Python dev to WSL2.</p>
</blockquote>
<p><img src="https://imgs.xkcd.com/comics/python_environment.png" alt="Had to" /></p>
<blockquote>
<p>https://xkcd.com/1987</p>
</blockquote>
<p>I had learned of Poetry a year ago, but still stuck to Conda. I never liked the Conda way of handling dependencies, but it is still of great help to install some data science tools that are not pure Python.</p>
<ul>
<li>Install Conda (I prefer to use the <a href="https://docs.conda.io/en/latest/miniconda.html">Miniconda installer</a>)</li>
<li>Install Poetry (through the <a href="https://python-poetry.org/docs/master/#installation">install-poetry.py script</a>). N.B. this requires a working installation of Python, so if you only installed it with Conda, install it through a shell with a Conda environment active.</li>
<li>Create a minimal Conda environment (<code class="language-plaintext highlighter-rouge">conda create --name remote-work-env python=3.8.5</code> or <code class="language-plaintext highlighter-rouge">conda create --file environment.yml</code>)</li>
<li>Set up a new project/Init an existing project using Poetry (e.g. <code class="language-plaintext highlighter-rouge">poetry init</code> from within an existing directory). N.B. Poetry should pick up the active Conda environment and not create a new one.</li>
<li>Manage dependencies with Poetry (<code class="language-plaintext highlighter-rouge">poetry add sqlalchemy</code>, <code class="language-plaintext highlighter-rouge">poetry add --dev pytest</code>)</li>
</ul>
<h3 id="github-actions">Github Actions</h3>
<p>Github Actions are a recent (and welcome) addition to Github, allowing CI/CD workflow right in Github. Although it is possible to stay in the Github ecosystem, some <em>Actions</em> still depends on external tools and API.</p>
<p>To get a good starting example, one can go to the Actions tab in a github repository, and at the <em>Choose a workflow template</em> select <em>Skip this and set up a workflow yourself</em>, an editable configuration template will be made available right from the browser.</p>
<p><img src="/assets/2021-08-13-architecture-patterns-with-python/2021-08-13-architecture-patterns-with-python_github_actions.png" alt="" /></p>
<p>For this repository, I have a single workflow containing the test passing, test coverage, linting and formatting. The Python setup is done thanks to an available Github Actions, dependency install thanks to Poetry, coverage through pytest-cov and <a href="https://app.codecov.io/gh/bmaingret/architecture-patterns-code-along">codecov</a>, and finally linting/formatting thanks to <a href="https://results.pre-commit.ci/repo/github/395353648">pre-commit.ci</a>. More on that in next part on pre-commits.</p>
<h3 id="git-hooks---pre-commit">Git hooks - Pre-commit</h3>
<p>Instead (or in addition) of automating code quality checks before merging, we can take part of the git hook to check these before even committing. Thanks to some good people, we have an awesome Python tool for that: <a href="https://pre-commit.com">pre-commit.com</a>. Once installed, configuration can be made through a <code class="language-plaintext highlighter-rouge">.pre-commit-config.yml</code> file and then installed with <code class="language-plaintext highlighter-rouge">pre-commit install</code>. Once <em>installed</em> you can see the generated Python script file <code class="language-plaintext highlighter-rouge">.git/hooks/pre-commit</code>.</p>
<p>To configure hooks with pre-commit, you need to specify git repositories. Several hooks are available at the <a href="https://github.com/pre-commit/pre-commit-hooks">pre-commit repository</a>, among which I configured:</p>
<ul>
<li>check-yaml</li>
<li>check-toml</li>
<li>end-of-file-fixer</li>
<li>trailing-whitespace</li>
</ul>
<p>To this I added <a href="https://github.com/PyCQA/flake8">Flake8</a> a Python linter, and <a href="https://github.com/psf/black">black</a> a code formatter (this will modify your code, although you can make it not, but that would go against the point).</p>
<p>Note that thanks to <a href="https://pre-commit.ci">pre-commit.ci</a>, the same configuration file can be used both to install the hooks locally, and to run checks in a Github Action.</p>
<h2 id="chapter-3---coupling-and-abstractions">Chapter 3 - Coupling and Abstractions</h2>
<blockquote>
<p>reduce the degree of coupling within a system by abstracting away the details</p>
</blockquote>
<p>Some key takeaways:</p>
<ul>
<li>Abstractions and decoupling help for testing (c.f. the repository pattern).</li>
<li>Separate the core logic code from external states.</li>
</ul>
<p>This usually allows to do <em>edge-to-edge</em> testing, faking some details (quite often I/O). This requires some additional abstractions (around the filesystem for instance) and new explicit dependencies on this abstractions).</p>
<h2 id="chapter-4---service-layer-pattern">Chapter 4 - Service Layer pattern</h2>
<blockquote>
<p>Also called an <em>orchestration layer</em> or a <em>use-case layer</em>.</p>
</blockquote>
<ul>
<li>Service layer exposes the domain service functionalities through endpoints to the external world.</li>
<li>It wraps the boring stuff such as validate entry, calling the domain model and updating it, and finally persisting anything</li>
<li>Interacting with our domain model is easier and allows for different type of interactions (cli, web, etc.)</li>
<li>Ease the high level and end-to-end tests, allowing for fewer tests, and easy refactoring of underlying domain models</li>
</ul>
<p>Although the concept of service is interesting, this chapter leaves things in a dubious state with still a lot of coupling towards the ORM from both the Flask app and services, the low details of domain implementations are everywhere, and testing is getting more and more difficult to init properly.</p>
<h2 id="chapter-5---tdd-in-high-gear-and-low-gear">Chapter 5 - TDD in High Gear and Low Gear</h2>
<p>Analogy made with biking where yous tart with low gear (unit tests) and then start moving towards higher gears (e2e tests). Allowing to hide further more the implementation details and to have tests with less coupling towards implementation details.</p>
<p>To reduce coupling with domain models:</p>
<ul>
<li>Fixture functions to help initialize domain models</li>
<li>Adding services that will handle the domain models</li>
</ul>
<p>I’ll just copy/paste from the book here for rules of thumb regarding tests to implement:</p>
<blockquote>
<ul>
<li>
<p>Aim for one end-to-end test per feature</p>
</li>
<li>
<p>Write the bulk of your tests against the service layer (edge-to-edge)</p>
</li>
<li>
<p>Maintain a small core of tests written against your domain model (maintain is the important word here: start with a lot and delete once they are covered by services)</p>
</li>
<li>
<p>Error handling counts as a feature</p>
</li>
<li>
<p>Express your service layer in terms of primitives rather than domain objects.</p>
</li>
</ul>
</blockquote>
<h2 id="chapter-6---unit-of-work">Chapter 6 - Unit Of Work</h2>
<p>Services and API are still tightly coupled with the data persistency and session management. Unit of Work define a single entry point for data storage, allowing to nicely handle transactions (commits, rollbacks, failures, etc). It also eases the integration between the service and the repository layers.</p>
<p>In Python, it fits very well the context manager type.</p>
<h2 id="chapter-7---aggregate-and-consistency-boundaries">Chapter 7 - Aggregate and Consistency Boundaries</h2>
<blockquote>
<p>Invariants are conditions that are always true.</p>
</blockquote>
<blockquote>
<p>Constraints are rules that restricts the possible states of the model</p>
</blockquote>
<p>In order to ensure invariants and constraints, and in addition of the logic behind, we need to ensure the data integrity, especially in concurrent operations. While we could lock the entire table/database we are manipulating this won’t scale up. The <strong>aggregate pattern</strong> groups up several domain objects in a container and allow to manipulate them as a single entity, thus ensuring data integrity and consistency of everything in it (the actual implementation will ensure it, not the simple use of the aggregate pattern).</p>
<p>The choice of aggregates is not simple and depends of the constraints of each project. Keep in mind that the less data in the domain models, the easiest it is to ensure invariants and constraints.</p>
<p><img src="/assets/2021-08-13-architecture-patterns-with-python/2021-08-13-architecture-patterns-with-python_recap_part1.png" alt="From Cosmic Python" /></p>
<blockquote>
<p><a href="https://github.com/cosmicpython/book/raw/master/images/apwp_0705.png">Direct link @Cosmyc Python</a></p>
</blockquote>
<h3 id="handling-concurrency">Handling concurrency</h3>
<p><strong>Optimistic concurrency</strong>: suppose things work fine most of the time</p>
<ul>
<li>Locking <em>things</em> at db level usually comes with performance cost -> usually used in a <em>pessimistic concurrency</em> mindset</li>
<li>Using version numbers to control update and be able to detect and recover from concurrent updates</li>
</ul>
<p>Note: there is a lot a db specific way of implementing locking at different levels (consistent read, select for update, etc.)</p>
<h2 id="chapter-8---events-and-the-message-bus">Chapter 8 - Events and the Message Bus</h2>
<p><em>Events</em> help to enforce the Single Responsibility Principle (the <em>S</em> of <em>SOLID</em>), preventing having multiple use cases tangled in a single place.</p>
<p><em>Message bus</em> allows to route the event messages to the different handlers. Typical middleware.</p>
<p>Events can be raised and handled at different places:</p>
<ul>
<li>Service layer takes events raised by the models and send them straight to message bus</li>
<li>Service layer raises events directly to the message bus</li>
<li>UoW collects events from aggregates and send them to the message bus</li>
</ul>
<p><img src="/assets/2021-08-13-architecture-patterns-with-python/2021-08-13-architecture-patterns-with-python_events.png" alt="" /></p>
<blockquote>
<p><a href="https://github.com/cosmicpython/book/blob/master/images/apwp_0801.png">Direct link @CosmicPython</a></p>
</blockquote>
<h2 id="chapter-9---going-to-town-on-the-message-bus">Chapter 9 - Going to Town on the Message Bus</h2>
<blockquote>
<p>Using the message bus as an entry point for the service layer.</p>
</blockquote>
<ul>
<li>allows to be granular and stick to SRP</li>
<li>allows to write tests in terms of events.</li>
</ul>
<p>Service functions become event handlers, and as such all internal and external actions are managed the same way, through event handlers.</p>
<p>When large changes are incoming adopt</p>
<blockquote>
<p>follow the Preparatory Refactoring workflow, aka “Make the change easy; then make the easy change”</p>
</blockquote>Baptiste MaingretNotes, references and codes I wrote while reading and coding along Architecture Patterns with Python by Harry Percival, Bob Gregory - O’Reilly. Work in progressReading session #32021-04-21T00:00:00+02:002021-04-21T00:00:00+02:00https://bmaingret.github.io/blog/reading-session-3<h2 id="articles">Articles</h2>
<ul>
<li><a href="#python-in-visual-studio-code--april-2021-release">Python in Visual Studio Code – April 2021 Release</a></li>
<li><a href="#is-manual-etl-better-than-no-code-etl-are-etl-tools-dead">Is manual ETL better than No-Code ETL: Are ETL tools dead?</a></li>
<li><a href="#the-explosion-of-roles-in-data-science">The Explosion of Roles in Data Science</a></li>
<li><a href="#the-sexiest-job-of-the-21st-century-isnt-sexy-anymore">The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore</a></li>
<li><a href="#data-scientist-vs-machine-learning-engineer-skills-heres-the-difference">Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.</a></li>
<li><a href="#10-tips-and-tricks-for-data-scientists-vol4">10 Tips and Tricks for Data Scientists Vol.4</a></li>
</ul>
<!-- -->
<h2 id="python-in-visual-studio-code--april-2021-release">Python in Visual Studio Code – April 2021 Release</h2>
<p>Source: <a href="https://devblogs.microsoft.com/python/python-in-visual-studio-code-april-2021-release/">devblogs.microsoft.com</a></p>
<p>I have been using VS Code for 2 years now, mostly for Python, including notebooks, but also for any text file.</p>
<h3 id="support-for-poetry-environments">Support for <a href="https://python-poetry.org">Poetry</a> environments</h3>
<p>Never used but probably will be the next I use if I ever move from conda. It ditches setup cfg an py files for the PEP pyproject.toml files. Also it allows to make the differences between what you wanted to install and every dependencies that got installed along!</p>
<h3 id="better-auto-completions-for-pytorch-using-pylance">Better auto-completions for Pytorch using Pylance</h3>
<p>Never had the opportunity to face these issues since so far I have used only TensorFlow and it was in Jupyter notebook where I have found auto-completion to be a bit clumsy.</p>
<h3 id="data-viewer-enhancements">Data Viewer Enhancements</h3>
<p>Data viewer is one of the reason to ditch Jupyter notebooks and use VS Code. Since I had spent some times on RStudio prior to Python for data science, this data viewer was one of the feature that was missing for me.</p>
<p>Enhancements listed:</p>
<ul>
<li>Ability to refresh the data viewer</li>
<li>Support for PyTorch and TensorFlow Tensor data types</li>
<li>Visual update</li>
<li>Ability to slice data (huge!): easily see specific dimensions of high dimension data</li>
</ul>
<h2 id="is-manual-etl-better-than-no-code-etl-are-etl-tools-dead">Is manual ETL better than No-Code ETL: Are ETL tools dead?</h2>
<p>Source: <a href="https://www.analyticsvidhya.com/blog/2021/04/is-manual-etl-better-than-no-code-etl-are-etl-tools-dead/">analyticsvidhya.com</a></p>
<p>Try to oppose GUI tools to pure code tools for ETL purposes, which I think in most cases don’t make so much sense. In most GUI tools you have a way to script some part to have custom tailored transformation for instance. And as for pure code for ETL, it can quickly become challenging and a human resource intensive process. In the end it usually comes down to the knowledge and know-how of the teams at work.</p>
<h2 id="the-explosion-of-roles-in-data-science">The Explosion of Roles in Data Science</h2>
<p>Source: <a href="https://towardsdatascience.com/the-explosion-of-roles-in-data-science-5963aa83e1c">towardsdatascience.com</a></p>
<blockquote>
<p>We have data scientists, data analysts, data engineers, machine learning engineers, analytics engineers, business intelligence engineers, data architects, data storytellers…</p>
</blockquote>
<p>How to overcome the overwhelming effect of so many different roles to chose from? Especially when job offers usually mixes everything and that you have no or little prior experience in any of these roles?</p>
<ul>
<li>“You are not your role”: don’t limit yourself to the job title, roles overlap more than often, and you are not tied to a job name</li>
<li>“Focus on abilities rather than on roles”: As I said it is often unclear what exactly lies under each role. But abilities stay the same. Some company might think a Data Engineer is a Database Admin. So? Don’t focus on the role but on the ability you’ll develop: manipulating database and SQL.</li>
<li>“Keep learning, keep improving”: Some company focus to much on what people know at the instant they want to recruit them. Considering the pace at which Data Science is evolving I think it is fair to accept people who have the fundamental abilities required for DS/ML and who keep on learning.</li>
</ul>
<h2 id="the-sexiest-job-of-the-21st-century-isnt-sexy-anymore">The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore</h2>
<p>Source: <a href="https://medium.com/illumination/the-sexiest-job-of-the-21st-century-isnt-sexy-anymore-fd5335a5d4d4">medium.com/illumination</a></p>
<ul>
<li>“#1 People Doesn’t Know What Actually Is Data Science”: Be the people wanted to get into DS and people trying to recruit datascientists.</li>
<li>“#2 Expectation vs. Reality — Here Lies A Wide, Wide Gap!”: People getting into DS think they’ll work each week on a new project with cool new tech or algorithm…</li>
<li>“#3 Lack of Upskilling for Data Science Professionals”: Things are moving so fast it can get hard to be an expert of anything, which companies are looking for</li>
<li>“#5 People Aren’t Willing To Wait”: It is a long road to become a proficient data scientist…</li>
</ul>
<h2 id="data-scientist-vs-machine-learning-engineer-skills-heres-the-difference">Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.</h2>
<p>Source: <a href="https://towardsdatascience.com/data-scientist-vs-machine-learning-engineer-skills-heres-the-difference-93eb2f4f6f98">towardsdatascience.com</a></p>
<p>This article seems just wrong to me:</p>
<ul>
<li>“a machine learning engineer does not necessarily need to know how random forest works, but they need to know how to save and load a file automatically”.</li>
<li>“If you can master these three base skills, you will be well on your way to being a great data scientist”. Sills being Python, Jupyter and SQL. If that’s all you require from a data scientist, look just above.</li>
</ul>
<h2 id="10-tips-and-tricks-for-data-scientists-vol4">10 Tips and Tricks for Data Scientists Vol.4</h2>
<p>Source: <a href="https://www.r-bloggers.com/2021/04/10-tips-and-tricks-for-data-scientists-vol-4/">r-bloggers.com</a></p>
<ul>
<li>You can get Google Drive data directly into Google Colab</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from google.colab import drive
drive.mount('content/gdrive')
</code></pre></div></div>
<ul>
<li>Reading/Writing Pandas DF directly as GZip (use <code class="language-plaintext highlighter-rouge">compression='gzip'</code>)</li>
</ul>
<p>Others can be checked directly but were not of much interest for me.</p>Baptiste MaingretArticles Python in Visual Studio Code – April 2021 Release Is manual ETL better than No-Code ETL: Are ETL tools dead? The Explosion of Roles in Data Science The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference. 10 Tips and Tricks for Data Scientists Vol.4 Python in Visual Studio Code – April 2021 Release Source: devblogs.microsoft.com I have been using VS Code for 2 years now, mostly for Python, including notebooks, but also for any text file. Support for Poetry environments Never used but probably will be the next I use if I ever move from conda. It ditches setup cfg an py files for the PEP pyproject.toml files. Also it allows to make the differences between what you wanted to install and every dependencies that got installed along! Better auto-completions for Pytorch using Pylance Never had the opportunity to face these issues since so far I have used only TensorFlow and it was in Jupyter notebook where I have found auto-completion to be a bit clumsy. Data Viewer Enhancements Data viewer is one of the reason to ditch Jupyter notebooks and use VS Code. Since I had spent some times on RStudio prior to Python for data science, this data viewer was one of the feature that was missing for me. Enhancements listed: Ability to refresh the data viewer Support for PyTorch and TensorFlow Tensor data types Visual update Ability to slice data (huge!): easily see specific dimensions of high dimension data Is manual ETL better than No-Code ETL: Are ETL tools dead? Source: analyticsvidhya.com Try to oppose GUI tools to pure code tools for ETL purposes, which I think in most cases don’t make so much sense. In most GUI tools you have a way to script some part to have custom tailored transformation for instance. And as for pure code for ETL, it can quickly become challenging and a human resource intensive process. In the end it usually comes down to the knowledge and know-how of the teams at work. The Explosion of Roles in Data Science Source: towardsdatascience.com We have data scientists, data analysts, data engineers, machine learning engineers, analytics engineers, business intelligence engineers, data architects, data storytellers… How to overcome the overwhelming effect of so many different roles to chose from? Especially when job offers usually mixes everything and that you have no or little prior experience in any of these roles? “You are not your role”: don’t limit yourself to the job title, roles overlap more than often, and you are not tied to a job name “Focus on abilities rather than on roles”: As I said it is often unclear what exactly lies under each role. But abilities stay the same. Some company might think a Data Engineer is a Database Admin. So? Don’t focus on the role but on the ability you’ll develop: manipulating database and SQL. “Keep learning, keep improving”: Some company focus to much on what people know at the instant they want to recruit them. Considering the pace at which Data Science is evolving I think it is fair to accept people who have the fundamental abilities required for DS/ML and who keep on learning. The Sexiest Job of the 21st Century Isn’t “Sexy” Anymore Source: medium.com/illumination “#1 People Doesn’t Know What Actually Is Data Science”: Be the people wanted to get into DS and people trying to recruit datascientists. “#2 Expectation vs. Reality — Here Lies A Wide, Wide Gap!”: People getting into DS think they’ll work each week on a new project with cool new tech or algorithm… “#3 Lack of Upskilling for Data Science Professionals”: Things are moving so fast it can get hard to be an expert of anything, which companies are looking for “#5 People Aren’t Willing To Wait”: It is a long road to become a proficient data scientist… Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference. Source: towardsdatascience.com This article seems just wrong to me: “a machine learning engineer does not necessarily need to know how random forest works, but they need to know how to save and load a file automatically”. “If you can master these three base skills, you will be well on your way to being a great data scientist”. Sills being Python, Jupyter and SQL. If that’s all you require from a data scientist, look just above. 10 Tips and Tricks for Data Scientists Vol.4 Source: r-bloggers.com You can get Google Drive data directly into Google Colab from google.colab import drive drive.mount('content/gdrive') Reading/Writing Pandas DF directly as GZip (use compression='gzip') Others can be checked directly but were not of much interest for me.Reading session #22021-03-21T00:00:00+01:002021-03-21T00:00:00+01:00https://bmaingret.github.io/blog/reading-session-2<h2 id="articles">Articles</h2>
<ul>
<li><a href="#bringing-machine-learning-models-into-production-without-effort-at-dailymotion">Bringing Machine Learning models into production without effort at Dailymotion</a></li>
<li><a href="#how-to-run-a-python-script-using-a-docker-container">How To Run a Python Script Using a Docker Container</a></li>
<li><a href="#how-to-build-a-dag-factory-on-airflow">How to build a DAG Factory on Airflow</a></li>
<li><a href="#deploying-machine-learning-into-production-dont-do-labs">Deploying Machine Learning into Production: Don’t do Labs.</a></li>
<li><a href="#mlops-is-changing-how-machine-learning-models-are-developed">MLOps Is Changing How Machine Learning Models Are Developed</a></li>
</ul>
<!-- -->
<h2 id="bringing-machine-learning-models-into-production-without-effort-at-dailymotion">Bringing Machine Learning models into production without effort at Dailymotion</h2>
<p>Source: <a href="https://medium.com/dailymotion/bring-machine-learning-models-faster-to-production-with-airflow-and-kubernetes-e9d47ca3bee5">medium.com/dailymotion</a></p>
<blockquote>
<p>How we manage to schedule Machine Learning pipelines seamlessly with Airflow and Kubernetes using KubernetesPodOperator</p>
</blockquote>
<p><img src="/assets/2021-00-00-reading-sessions/dailymotion_bring-machine-learning-models.png" alt="Data Scientists versus Data Engineers" /></p>
<p><em>Life cycle of a machine learning model - Dailymotion (c)</em></p>
<p>They followed what I would call a classic evolution:</p>
<ul>
<li>First a containerized approach with an always up VM-like and simple cron-like schedule</li>
<li>Then VM-like instantiation to run the container on-demand</li>
<li>Finally define node pools of different types allowing to run on-demand container and share resources</li>
</ul>
<p>All of this thanks to the good integration between Airflow and Kubernetes (c.f. KubernetsPodOperator).</p>
<p>In the article they mention <a href="https://www.kubeflow.org/docs/about/kubeflow/">Kubeflow</a> a Google service specifically thought for Machine Learning Workflow.</p>
<h2 id="how-to-run-a-python-script-using-a-docker-container">How To Run a Python Script Using a Docker Container</h2>
<p>Source: <a href="https://towardsdatascience.com/how-to-mount-a-directory-inside-a-docker-container-4cee379c298b">towardsdatascience.com</a></p>
<p>Very simple introduction to setting up a Docker with required tools and software. Although I find Docker very attractive for reusability it does require do put everything explicitly which in my typical working environment based on conda can be a pain if I don’t want to add too much useless overhead.</p>
<h2 id="how-to-build-a-dag-factory-on-airflow">How to build a DAG Factory on Airflow</h2>
<p>Source: <a href="https://towardsdatascience.com/how-to-build-a-dag-factory-on-airflow-9a19ab84084c">towardsdatascience.com</a></p>
<p>I encountered similar concern when trying out Jenkins at work, and there was very little example on how to set things up properly. In addition, since I am more of a configuration over code type of guy, I wanted to take part of recently introduced Jenkins Pipeline. Trying to set up the two while being the only knowledgeable person on this topic seems like too much work and an unnecessary SPOF for our needs in the end.</p>
<p>In this case the final result is appealing but it still seems a bit of a hack (assumptions of only Python files, and relying on hardcoded heuristic DAG detection rule) and I’d rather like that a tool such as Airflow provides some common interface for this.</p>
<h2 id="deploying-machine-learning-into-production-dont-do-labs">Deploying Machine Learning into Production: Don’t do Labs.</h2>
<p>Source: <a href="https://towardsdatascience.com/deploying-machine-learning-into-production-dont-do-labs-7dd35576da3f">towardsdatascience.com</a></p>
<p>Although I was aware that a large majority of data science projects don’t make it to production (87% claimed here), this article states that this is not due to the lack of value for the models but more to the difficulty to scale.</p>
<ul>
<li>Gap between data scientists and data engineers</li>
<li>Development in isolated environments away from user interaction, production constraints and businesses</li>
<li>ML adds additional metrics to monitor that can be either difficult to implement or to foresee (e.g. gender bias)</li>
</ul>
<p>I thought this article would describe how to industrialize development environment to match the production environment but it actually makes a point that this is not enough. They put data scientists in the existing product teams where they work as additional resources. This allow first to work hands-on with the people that are responsible for the end product, and also for the product owner to make enlightened decisions on where to put effort.</p>
<p>Some advantages of working in <em>labs</em> that are yet to be replicated in their new paradigm:</p>
<blockquote>
<p>System Thinking</p>
<ul>
<li>Allow to get out of the team routine and approach each problem from with a fresh perspective</li>
</ul>
</blockquote>
<blockquote>
<p>Ideal Design</p>
<ul>
<li>Don’t limit yourself to what seems possible and not what would be ideal</li>
</ul>
</blockquote>
<blockquote>
<p>Always be Innovating</p>
<ul>
<li>Use cutting-edge tools and solutions without having limitations on what will be possible. Explore and test without having to think rentability and production efficiency.</li>
</ul>
</blockquote>
<h2 id="mlops-is-changing-how-machine-learning-models-are-developed">MLOps Is Changing How Machine Learning Models Are Developed</h2>
<p>Source: <a href="https://www.kdnuggets.com/2020/12/mlops-changing-machine-learning-developed.html">kdnuggets.com</a></p>
<p>MLOps has to be one of the most popular topic in ML world today. Here a few key points are addressed to show the implication of moving from ML labs to proper production-ready ML.</p>
<blockquote>
<p>Version Control is Not Just for Code</p>
</blockquote>
<p>Data versioning is a big concern in ML. With continuously larger data sets typical versioning tools might not be applicable. In addition, GDPR-like concerns means that data is usually more sensitive information than code-base.</p>
<blockquote>
<p>Build Safeguards into the Code</p>
</blockquote>
<p>Checking the input data used for training, validation, etc. prevent big swings in your models trainings. Similarly checking differences between the previous and new models allows early detection of issues (could be done by checking element by element prediction differences).</p>
<blockquote>
<p>The Pipeline is the Product – Not the Model</p>
</blockquote>
<blockquote>
<p>You give a poor man a fish and you feed him for a day. You teach him to fish and you give him an occupation that will feed him for a lifetime.”</p>
</blockquote>
<p>Only the whole pipeline allows for proper control over the models and in the end, positive value for the project.</p>
<p>I’d cross this with previous article on data scientists working in labs. Even a well designed and defined pipeline might not be production ready nor usable in real-world.</p>Baptiste MaingretArticles Bringing Machine Learning models into production without effort at Dailymotion How To Run a Python Script Using a Docker Container How to build a DAG Factory on Airflow Deploying Machine Learning into Production: Don’t do Labs. MLOps Is Changing How Machine Learning Models Are Developed Bringing Machine Learning models into production without effort at Dailymotion Source: medium.com/dailymotion How we manage to schedule Machine Learning pipelines seamlessly with Airflow and Kubernetes using KubernetesPodOperator Life cycle of a machine learning model - Dailymotion (c) They followed what I would call a classic evolution: First a containerized approach with an always up VM-like and simple cron-like schedule Then VM-like instantiation to run the container on-demand Finally define node pools of different types allowing to run on-demand container and share resources All of this thanks to the good integration between Airflow and Kubernetes (c.f. KubernetsPodOperator). In the article they mention Kubeflow a Google service specifically thought for Machine Learning Workflow. How To Run a Python Script Using a Docker Container Source: towardsdatascience.com Very simple introduction to setting up a Docker with required tools and software. Although I find Docker very attractive for reusability it does require do put everything explicitly which in my typical working environment based on conda can be a pain if I don’t want to add too much useless overhead. How to build a DAG Factory on Airflow Source: towardsdatascience.com I encountered similar concern when trying out Jenkins at work, and there was very little example on how to set things up properly. In addition, since I am more of a configuration over code type of guy, I wanted to take part of recently introduced Jenkins Pipeline. Trying to set up the two while being the only knowledgeable person on this topic seems like too much work and an unnecessary SPOF for our needs in the end. In this case the final result is appealing but it still seems a bit of a hack (assumptions of only Python files, and relying on hardcoded heuristic DAG detection rule) and I’d rather like that a tool such as Airflow provides some common interface for this. Deploying Machine Learning into Production: Don’t do Labs. Source: towardsdatascience.com Although I was aware that a large majority of data science projects don’t make it to production (87% claimed here), this article states that this is not due to the lack of value for the models but more to the difficulty to scale. Gap between data scientists and data engineers Development in isolated environments away from user interaction, production constraints and businesses ML adds additional metrics to monitor that can be either difficult to implement or to foresee (e.g. gender bias) I thought this article would describe how to industrialize development environment to match the production environment but it actually makes a point that this is not enough. They put data scientists in the existing product teams where they work as additional resources. This allow first to work hands-on with the people that are responsible for the end product, and also for the product owner to make enlightened decisions on where to put effort. Some advantages of working in labs that are yet to be replicated in their new paradigm: System Thinking Allow to get out of the team routine and approach each problem from with a fresh perspective Ideal Design Don’t limit yourself to what seems possible and not what would be ideal Always be Innovating Use cutting-edge tools and solutions without having limitations on what will be possible. Explore and test without having to think rentability and production efficiency. MLOps Is Changing How Machine Learning Models Are Developed Source: kdnuggets.com MLOps has to be one of the most popular topic in ML world today. Here a few key points are addressed to show the implication of moving from ML labs to proper production-ready ML. Version Control is Not Just for Code Data versioning is a big concern in ML. With continuously larger data sets typical versioning tools might not be applicable. In addition, GDPR-like concerns means that data is usually more sensitive information than code-base. Build Safeguards into the Code Checking the input data used for training, validation, etc. prevent big swings in your models trainings. Similarly checking differences between the previous and new models allows early detection of issues (could be done by checking element by element prediction differences). The Pipeline is the Product – Not the Model You give a poor man a fish and you feed him for a day. You teach him to fish and you give him an occupation that will feed him for a lifetime.” Only the whole pipeline allows for proper control over the models and in the end, positive value for the project. I’d cross this with previous article on data scientists working in labs. Even a well designed and defined pipeline might not be production ready nor usable in real-world.Reading session #12021-02-05T00:00:00+01:002021-02-05T00:00:00+01:00https://bmaingret.github.io/blog/reading-session-1<h2 id="articles">Articles</h2>
<ul>
<li><a href="#apache-superset">Apache Superset</a></li>
<li><a href="#create-a-devops-culture-with-open-source-principles">Create a DevOps culture with open source principles</a></li>
<li><a href="#google-recommits-to-the-python-ecosystem">Google recommits to the Python ecosystem</a></li>
<li><a href="#now-announcing-makefile-support-in-visual-studio-code">Now announcing: Makefile support in Visual Studio Code!</a></li>
<li><a href="#abracadabra-bringing-the-magics-to-xeus-python">Abracadabra! Bringing the magics to xeus-python</a></li>
</ul>
<!-- -->
<h2 id="apache-superset">Apache Superset</h2>
<p>Source: <a href="https://superset.apache.org/">superset.apache.org</a></p>
<blockquote>
<p>Apache Superset (Incubating) is a modern, enterprise-ready business intelligence web application.</p>
</blockquote>
<p><img src="/assets/2021-00-00-reading-sessions/superset2021-02-05%20180319.png" alt="Superset gallery" /></p>
<p>Some comments:</p>
<ul>
<li>“Enterprise-ready” might be a stretch from others experiences</li>
<li>It seems to me to be more appropriate to be used with a single consolidated database (think datalake) than multiple databases</li>
<li>Some advanced charts but with simple/cleaned data. It really is a visualization tool, all preprocessing must be done before.</li>
</ul>
<h2 id="create-a-devops-culture-with-open-source-principles">Create a DevOps culture with open source principles</h2>
<p>Source: <a href="">https://opensource.com/article/20/12/remote-devops</a></p>
<p>I find this article to provide reasonable guidelines that are applicable to IT departments (and probably others) being remote or not.</p>
<p>Open source principles:</p>
<h3 id="community">Community</h3>
<blockquote>
<p>Being part of team goals can help people escape the stress of the home front.</p>
</blockquote>
<p>Nothing worse than to be home alone facing an issue with no support.</p>
<h3 id="collaboration">Collaboration</h3>
<blockquote>
<p>Collaboration—during a pandemic or not—is about culture, not the latest tool or platform.</p>
</blockquote>
<p>Oftentimes people look for new tools to help them to better collaborate whereas mindset</p>
<h3 id="transparency">Transparency</h3>
<blockquote>
<p>Remote DevOps teams benefit from centralizing access to project information and materials.</p>
</blockquote>
<p>Although I find that every information shouldn’t be directly address to everyone, there is nothing worse than withholding information. I think everyone is able and should be allowed to comment and give one’s opinion.</p>
<h3 id="release-early-and-often">Release early and often</h3>
<blockquote>
<p>When a remote DevOps team releases early and often, they prove the remote work model’s validity and give stakeholders something real to see</p>
</blockquote>
<p>Although in some companies and industries it is not welcome to come with unfinished projects, I find it rewarding and motivating for everyone. Good balance is necessary since each release does bring additional work.</p>
<h3 id="pivot-and-refresh">Pivot and refresh</h3>
<blockquote>
<p>Just as you stop to correct software delivery issues, you need to start doing the same with communications and collaboration.</p>
</blockquote>
<p>When you find something did not happened as expected because of poor communication, don’t mourn on it, take it as an opportunity to make some changes.</p>
<h2 id="google-recommits-to-the-python-ecosystem">Google recommits to the Python ecosystem</h2>
<p>Source: <a href="">https://sdtimes.com/softwaredev/google-recommits-to-the-python-ecosystem</a></p>
<p>Google cloud environment was my <a href="https://github.com/bmaingret/kaist-wst660-gae-app">first experience with Python</a> thanks to their Google App Engine.</p>
<p>It is always important for large companies to put significant amount of support to these technologies considering how they build on it, while being careful not to fall in situations similar to Oracle/SUN and the Oracle/Java ecosystem.</p>
<h2 id="now-announcing-makefile-support-in-visual-studio-code">Now announcing: Makefile support in Visual Studio Code!</h2>
<p>Source: <a href="">https://devblogs.microsoft.com/cppblog/now-announcing-makefile-support-in-visual-studio-code</a></p>
<p>Although make and makefile are more than 50 year old, and are not the new shiny tools, they can be of great help in data-science projects.</p>
<p>There is usually a lot of similar steps required to set-up environments, run different steps (data ETL, training models, evaluation, etc.), managing cloud resources, etc. which can greatly be speed up and made less error-prone thanks to makefiles.</p>
<p>Another article presenting some of it: <a href="">https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c</a>.</p>
<h2 id="abracadabra-bringing-the-magics-to-xeus-python">Abracadabra! Bringing the magics to xeus-python</h2>
<p>Source: <a href="">https://blog.jupyter.org/abracadabra-bringing-the-magics-to-xeus-python-9d17bcfacb4</a></p>
<p>It is always interesting to read documentation on the behind the scenes of some of the tools we used. If find it greatly enhance they way I work with them by getting a glimpse of how and why the tool was made as it is and where it is going.</p>
<p>This is article presents the work on the <a href="https://blog.jupyter.org/a-new-python-kernel-for-jupyter-fcdf211e30a8">next Jupyter kernel</a>, partly based on Xeus a C++ implementation of the Jupyter kernel protocol. And reading this, all of a sudden we can discover some of the components of what we usually call a Jupyter notebook…</p>Baptiste MaingretArticles Apache Superset Create a DevOps culture with open source principles Google recommits to the Python ecosystem Now announcing: Makefile support in Visual Studio Code! Abracadabra! Bringing the magics to xeus-python Apache Superset Source: superset.apache.org Apache Superset (Incubating) is a modern, enterprise-ready business intelligence web application. Some comments: “Enterprise-ready” might be a stretch from others experiences It seems to me to be more appropriate to be used with a single consolidated database (think datalake) than multiple databases Some advanced charts but with simple/cleaned data. It really is a visualization tool, all preprocessing must be done before. Create a DevOps culture with open source principles Source: https://opensource.com/article/20/12/remote-devops I find this article to provide reasonable guidelines that are applicable to IT departments (and probably others) being remote or not. Open source principles: Community Being part of team goals can help people escape the stress of the home front. Nothing worse than to be home alone facing an issue with no support. Collaboration Collaboration—during a pandemic or not—is about culture, not the latest tool or platform. Oftentimes people look for new tools to help them to better collaborate whereas mindset Transparency Remote DevOps teams benefit from centralizing access to project information and materials. Although I find that every information shouldn’t be directly address to everyone, there is nothing worse than withholding information. I think everyone is able and should be allowed to comment and give one’s opinion. Release early and often When a remote DevOps team releases early and often, they prove the remote work model’s validity and give stakeholders something real to see Although in some companies and industries it is not welcome to come with unfinished projects, I find it rewarding and motivating for everyone. Good balance is necessary since each release does bring additional work. Pivot and refresh Just as you stop to correct software delivery issues, you need to start doing the same with communications and collaboration. When you find something did not happened as expected because of poor communication, don’t mourn on it, take it as an opportunity to make some changes. Google recommits to the Python ecosystem Source: https://sdtimes.com/softwaredev/google-recommits-to-the-python-ecosystem Google cloud environment was my first experience with Python thanks to their Google App Engine. It is always important for large companies to put significant amount of support to these technologies considering how they build on it, while being careful not to fall in situations similar to Oracle/SUN and the Oracle/Java ecosystem. Now announcing: Makefile support in Visual Studio Code! Source: https://devblogs.microsoft.com/cppblog/now-announcing-makefile-support-in-visual-studio-code Although make and makefile are more than 50 year old, and are not the new shiny tools, they can be of great help in data-science projects. There is usually a lot of similar steps required to set-up environments, run different steps (data ETL, training models, evaluation, etc.), managing cloud resources, etc. which can greatly be speed up and made less error-prone thanks to makefiles. Another article presenting some of it: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c. Abracadabra! Bringing the magics to xeus-python Source: https://blog.jupyter.org/abracadabra-bringing-the-magics-to-xeus-python-9d17bcfacb4 It is always interesting to read documentation on the behind the scenes of some of the tools we used. If find it greatly enhance they way I work with them by getting a glimpse of how and why the tool was made as it is and where it is going. This is article presents the work on the next Jupyter kernel, partly based on Xeus a C++ implementation of the Jupyter kernel protocol. And reading this, all of a sudden we can discover some of the components of what we usually call a Jupyter notebook…Motor Trend Car Road Tests (mtcars) datasets - Analysis and Regression2019-11-12T00:00:00+01:002019-11-12T00:00:00+01:00https://bmaingret.github.io/blog/Motor-Trend-Car-Road-Tests-Analysis-and-Regression<h2 id="motor-trend-car-road-tests-mtcars-datasets---analysis-and-regression">Motor Trend Car Road Tests (mtcars) datasets - Analysis and Regression</h2>
<p>This assignment was part of the Johns Hopkins Coursera module on
<a href="https://www.coursera.org/learn/regression-models">Regression Models</a> as
part of the <a href="https://www.coursera.org/specializations/jhu-data-science">Data Sciene
Specialization</a>.</p>
<!--more-->
<p>Source code available on
<a href="https://github.com/bmaingret/coursera-data-science-jhu/tree/master/07-regression-models/01-project">GitHub</a></p>
<h2 id="summary">Summary</h2>
<p>We want to answer these two questions:</p>
<ul>
<li>Is an automatic or manual transmission better for MPG?</li>
<li>Quantify the MPG difference between automatic and manual
transmissions?</li>
</ul>
<p>We compared the mean mpg for automatic and manual transmission and
concluded the difference in favor of manual tranmission in terms of mpg
was significant. We then looked further to check other variables to
explain the difference in mpg.</p>
<h2 id="look-at-the-data">Look at the data</h2>
<p>Glimpse at the data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## mpg cyl disp hp drat wt qsec vs am gear carb mean.mpg
## 1 21.0 6 160 110 3.90 2.620 16.46 v.shaped manual 4 4 20.09062
## 2 21.0 6 160 110 3.90 2.875 17.02 v.shaped manual 4 4 20.09062
## 3 22.8 4 108 93 3.85 2.320 18.61 straight manual 4 1 20.09062
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 v.shaped:18
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 straight:14
## Median :3.695 Median :3.325 Median :17.71
## Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :4.930 Max. :5.424 Max. :22.90
## am gear carb
## automatic:19 Min. :3.000 Min. :1.000
## manual :13 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
</code></pre></div></div>
<h2 id="mpg-difference-between-automatic-and-manual-transmission">MPG difference between automatic and manual transmission</h2>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-4-1.png" alt="" /><!-- --></p>
<p>Looking at the boxplot we see a difference between the two transmission
type’s mpg.</p>
<p>We check normality, variance equality to see how we can conduct our test
(details in appendix), and then conducted a two-sided T-Test:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mpg.test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">auto</span><span class="p">,</span><span class="w"> </span><span class="n">manual</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We have a p-value of 0.14% < 5%, and a confidence interval [-11 ;
-3.2] for the difference of mean mpg between automatic and manual
excluding 0.</p>
<p>From the look of this manual transmission allows for more mpg with 0
more mpg in average.</p>
<p>If we fit a simple linear model to our data we end up with similar
results as previously (increased of roughly 7.2 mpg), and we can have a
look at the residual plot, which are alost normal (graphically speaking)
for automatic but not as much for manual. Looking at the reisudals
against several other possible predictors, we can see some linear trends
(e.g. hp and wt).</p>
<h2 id="going-further">Going further</h2>
<p>Looking at pairplot and correlation plot we see that other variables
since more correlated with mpg than am.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ggpairs</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">am</span><span class="p">),</span><span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">11</span><span class="p">,</span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">progress</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">continuous</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wrap</span><span class="p">(</span><span class="s2">"cor"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)))</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-6-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mtcars.cor</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor</span><span class="p">(</span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">am</span><span class="o">=</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">am</span><span class="p">),</span><span class="w"> </span><span class="n">vs</span><span class="o">=</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">vs</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">mean.mpg</span><span class="p">)))</span><span class="w">
</span><span class="n">corrplot</span><span class="p">(</span><span class="n">mtcars.cor</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"upper"</span><span class="p">,</span><span class="w"> </span><span class="n">order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hclust"</span><span class="p">,</span><span class="w"> </span><span class="n">tl.col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">tl.srt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-6-2.png" alt="" /><!-- --></p>
<h3 id="adding-variables-to-our-model">Adding variables to our model</h3>
<p>We can try to add wt, cyl and disp wich seems to be relevant candidates
both from mechanical point of view and from the corrplot.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rownames</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">datasets</span><span class="o">::</span><span class="n">mtcars</span><span class="p">)</span><span class="w">
</span><span class="n">fit2</span><span class="o"><-</span><span class="n">lm</span><span class="p">(</span><span class="n">mpg</span><span class="o">~</span><span class="n">I</span><span class="p">(</span><span class="n">hp</span><span class="o">/</span><span class="m">10</span><span class="p">)</span><span class="o">+</span><span class="n">wt</span><span class="o">+</span><span class="n">cyl</span><span class="o">+</span><span class="n">disp</span><span class="o">+</span><span class="n">am</span><span class="p">,</span><span class="n">mtcars</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">fit2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Call:
## lm(formula = mpg ~ I(hp/10) + wt + cyl + disp + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5952 -1.5864 -0.7157 1.2821 5.5725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
## I(hp/10) -0.27960 0.13922 -2.008 0.05510 .
## wt -3.30262 1.13364 -2.913 0.00726 **
## cyl -1.10638 0.67636 -1.636 0.11393
## disp 0.01226 0.01171 1.047 0.30472
## ammanual 1.55649 1.44054 1.080 0.28984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
## F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
</code></pre></div></div>
<p>Only weight, hp and tranmission type seems significant.</p>
<h3 id="modelling-withough-transmission-type">Modelling withough transmission type</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fit3</span><span class="o"><-</span><span class="n">lm</span><span class="p">(</span><span class="n">mpg</span><span class="o">~</span><span class="n">I</span><span class="p">(</span><span class="n">hp</span><span class="o">/</span><span class="m">10</span><span class="p">)</span><span class="o">+</span><span class="n">wt</span><span class="o">+</span><span class="n">cyl</span><span class="o">+</span><span class="n">disp</span><span class="p">,</span><span class="n">mtcars</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">fit3</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Call:
## lm(formula = mpg ~ I(hp/10) + wt + cyl + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## I(hp/10) -0.20538 0.12147 -1.691 0.102379
## wt -3.85390 1.01547 -3.795 0.000759 ***
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">anova</span><span class="p">(</span><span class="n">fit2</span><span class="p">,</span><span class="n">fit3</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Analysis of Variance Table
##
## Model 1: mpg ~ I(hp/10) + wt + cyl + disp + am
## Model 2: mpg ~ I(hp/10) + wt + cyl + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 163.12
## 2 27 170.44 -1 -7.3245 1.1675 0.2898
</code></pre></div></div>
<p>We see we have similar R-square, RSS and p-value while droping the
transmission type.</p>
<h3 id="automatic-model-selection">Automatic model selection</h3>
<p>Let’s try some automatic model selection to see what we could get.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Fit the full model </span><span class="w">
</span><span class="n">full.model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">mpg</span><span class="w"> </span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datasets</span><span class="o">::</span><span class="n">mtcars</span><span class="p">)</span><span class="w">
</span><span class="c1"># Stepwise regression model</span><span class="w">
</span><span class="n">step.model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stepAIC</span><span class="p">(</span><span class="n">full.model</span><span class="p">,</span><span class="w"> </span><span class="n">direction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">,</span><span class="w">
</span><span class="n">trace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">step.model</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = datasets::mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
</code></pre></div></div>
<p>We find again wt and am which confort us in our previous models. We also
have an additional variable that we did not explore before: qsec.</p>
<p>We can however argue that qsec is strongly correlated with horsepower
(and cylinder, displacement, etc.)</p>
<h3 id="some-pca">Some PCA</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="s2">"FactoMineR"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"factoextra"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res.pca</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">PCA</span><span class="p">(</span><span class="n">datasets</span><span class="o">::</span><span class="n">mtcars</span><span class="p">,</span><span class="w"> </span><span class="n">scale.unit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">ncp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">graph</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">fviz_pca_var</span><span class="p">(</span><span class="n">res.pca</span><span class="p">,</span><span class="w"> </span><span class="n">col.var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cos2"</span><span class="p">,</span><span class="w"> </span><span class="n">repel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-10-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fviz_eig</span><span class="p">(</span><span class="n">res.pca</span><span class="p">,</span><span class="w"> </span><span class="n">addlabels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-10-2.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fviz_contrib</span><span class="p">(</span><span class="n">res.pca</span><span class="p">,</span><span class="w"> </span><span class="n">choice</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"var"</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">top</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-10-3.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fviz_contrib</span><span class="p">(</span><span class="n">res.pca</span><span class="p">,</span><span class="w"> </span><span class="n">choice</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"var"</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">top</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-10-4.png" alt="" /><!-- --></p>
<h2 id="normality-and-variance">Normality and variance</h2>
<h3 id="normality-of-data">Normality of data</h3>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-11-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">shapiro.test</span><span class="p">(</span><span class="n">manual</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Shapiro-Wilk normality test
##
## data: manual
## W = 0.9458, p-value = 0.5363
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">shapiro.test</span><span class="p">(</span><span class="n">auto</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Shapiro-Wilk normality test
##
## data: auto
## W = 0.97677, p-value = 0.8987
</code></pre></div></div>
<h3 id="comparison-of-variance">Comparison of variance</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">var.test</span><span class="p">(</span><span class="n">auto</span><span class="p">,</span><span class="w"> </span><span class="n">manual</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## F test to compare two variances
##
## data: auto and manual
## F = 0.38656, num df = 18, denom df = 12, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1243721 1.0703429
## sample estimates:
## ratio of variances
## 0.3865615
</code></pre></div></div>
<h3 id="t-test">T-Test</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mpg.test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">auto</span><span class="p">,</span><span class="w"> </span><span class="n">manual</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">mpg.test</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## Welch Two Sample t-test
##
## data: auto and manual
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
</code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-15-1.png" alt="" /><!-- --></p>
<h3 id="residual-plots">Residual plots</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fit</span><span class="o"><-</span><span class="n">lm</span><span class="p">(</span><span class="n">mpg</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">am</span><span class="p">,</span><span class="w"> </span><span class="n">mtcars</span><span class="p">)</span><span class="w">
</span><span class="n">qplot</span><span class="p">(</span><span class="n">residuals</span><span class="p">(</span><span class="n">fit</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">mtcars</span><span class="o">$</span><span class="n">am</span><span class="p">,</span><span class="w"> </span><span class="n">geom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'density'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-16-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mtcars</span><span class="o">$</span><span class="n">mpg.resid</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">residuals</span><span class="p">(</span><span class="n">fit</span><span class="p">)</span><span class="w">
</span><span class="n">mtcars.gathered</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="n">am</span><span class="p">,</span><span class="w"> </span><span class="n">mpg.resid</span><span class="p">,</span><span class="w"> </span><span class="n">cyl</span><span class="p">,</span><span class="w"> </span><span class="n">disp</span><span class="p">,</span><span class="w"> </span><span class="n">hp</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="p">,</span><span class="w"> </span><span class="n">qsec</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate_if</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">gather</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">am</span><span class="p">,</span><span class="n">mpg.resid</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">mtcars.gathered</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mpg.resid</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">am</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_grid</span><span class="p">(</span><span class="n">.</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">key</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Motor-Trend-Car-Road-Tests-Analysis-and-Regression/figure-gfm/unnamed-chunk-16-2.png" alt="" /><!-- --></p>Baptiste MaingretMotor Trend Car Road Tests (mtcars) datasets - Analysis and Regression This assignment was part of the Johns Hopkins Coursera module on Regression Models as part of the Data Sciene Specialization.Effect of Vitamin C on Tooth Growth in Guinea Pigs2019-10-30T00:00:00+01:002019-10-30T00:00:00+01:00https://bmaingret.github.io/blog/Effect-of-Vitamin-C-on-Tooth-Growth-in-Guinea-Pigs<h2 id="basic-inferential-data-analysis-on-toothgrowth-dataset-part-of-statistical-inference-by-johns-hopkins-university">Basic Inferential Data Analysis on ToothGrowth dataset (part of Statistical Inference by Johns Hopkins University)</h2>
<p>This assignment was part of the Johns Hopkins Coursera module on
<a href="https://www.coursera.org/learn/statistical-inference">Statistical
Inference</a> as part
of the <a href="https://www.coursera.org/specializations/jhu-data-science">Data Science
Specialization</a>.</p>
<!--more-->
<p>Source code available on
<a href="https://github.com/bmaingret/coursera-data-science-jhu/tree/master/06-statistical-inference/01-project">GitHub</a></p>
<h2 id="overview">Overview</h2>
<p>The goal is to conduct some simple hypothesis testing on the ToothGrowth
dataset available in the R datasets package.</p>
<p>Some assumptions:</p>
<ul>
<li>equal variances among groups</li>
<li>standard deviation estimated from the samples</li>
<li><img src="https://render.githubusercontent.com/render/math?math=\alpha" /> is set to 5%</li>
<li>samples are not paired</li>
</ul>
<h2 id="data-processing">Data processing</h2>
<p>We import the data and directly set the <em>dose</em> as a factor.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">datasets</span><span class="p">)</span><span class="w">
</span><span class="n">tg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">datasets</span><span class="o">::</span><span class="n">ToothGrowth</span><span class="w">
</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Glimpse at data.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span><span class="p">(</span><span class="n">tg</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">summary</span><span class="p">(</span><span class="n">tg</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
</code></pre></div></div>
<p>Some plots.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="n">qplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">len</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">tg</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">dose</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dose</span><span class="p">,</span><span class="w"> </span><span class="n">geom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"density"</span><span class="p">,</span><span class="w"> </span><span class="n">facets</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dose</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">supp</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-30-Effect-of-Vitamin-C-on-Tooth-Growth-in-Guinea-Pigs/figure-gfm/unnamed-chunk-3-1.png" alt="" /><!-- --></p>
<h3 id="has-the-delivery-method-an-impact-on-tooth-growth">Has the delivery method an impact on tooth growth?</h3>
<p>We will test in regards of the null-hypothesis that their is no
difference in means between the two groups.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">supp</span><span class="o">==</span><span class="s2">"OJ"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">supp</span><span class="o">==</span><span class="s2">"VC"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">p.sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">((</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">t.res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">p.res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">power.t.test</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">p.sd</span><span class="p">,</span><span class="w"> </span><span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"two.sample"</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We have a p-value (6.0393371%) larger the 5% and in addition the
confidence interval (-0.1670064, 7.5670064) covers the value 0. We fail
to reject the null hypothesis in this case.</p>
<h3 id="has-the-dose-an-impact-on-tooth-growth">Has the dose an impact on tooth growth?</h3>
<p>We test the difference in means between each dosage (3 tests: 0.05 vs 1, 0.5 vs 2, 1 vs 2).</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"0.5"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">p.sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">((</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">t.res.a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">p.res.a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">power.t.test</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">p.sd</span><span class="p">,</span><span class="w"> </span><span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"two.sample"</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"0.5"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">p.sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">((</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">t.res.b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">p.res.b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">power.t.test</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">p.sd</span><span class="p">,</span><span class="w"> </span><span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"two.sample"</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tg</span><span class="p">[</span><span class="n">tg</span><span class="o">$</span><span class="n">dose</span><span class="o">==</span><span class="s2">"2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"len"</span><span class="p">]</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">p.sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">((</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">t.res.c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t.test</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"two.sided"</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">paired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">var.equal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">p.res.c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">power.t.test</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">p.sd</span><span class="p">,</span><span class="w"> </span><span class="n">sig.level</span><span class="o">=</span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"two.sample"</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="o">=</span><span class="s2">"two.sided"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## dose.0.5v1 dose0.5v2 dose.1v2
## p-value 1.266297e-07 2.837553e-14 1.810829e-05
## conf-interval-low -1.198375e+01 -1.815352e+01 -8.994387e+00
## conf-interval-up -6.276252e+00 -1.283648e+01 -3.735613e+00
## power 9.909607e-01 1.000000e+00 9.057799e-01
</code></pre></div></div>
<h2 id="conclusions">Conclusions</h2>
<p>We failed to reject the null-hypothesis regarding the impact of the
delivery method on tooth growth.</p>
<p>The dosage was found to be statistically significant and tests rejected
the null-hypothesis.</p>Baptiste MaingretBasic Inferential Data Analysis on ToothGrowth dataset (part of Statistical Inference by Johns Hopkins University) This assignment was part of the Johns Hopkins Coursera module on Statistical Inference as part of the Data Science Specialization.EDA of activity monitoring data2019-10-23T00:00:00+02:002019-10-23T00:00:00+02:00https://bmaingret.github.io/blog/Reproducible-Research_Activity-data<h2 id="activity-monitoring-part-of-reproducible-research-module-by-johns-hopkins-university">Activity monitoring (part of Reproducible Research module by Johns Hopkins University)</h2>
<p>This assignment was part of the Johns Hopkins Coursera module on <a href="https://www.coursera.org/learn/reproducible-research">Reproducible Research</a> as part of the <a href="https://www.coursera.org/specializations/jhu-data-science">Data Sciene Specialization</a>.</p>
<p><!--more--></p>
<p>Full code can be found on <a href="https://github.com/bmaingret/coursera-data-science-jhu/tree/master/05-reproducible-research/01-week2-assignement">GitHub</a>.</p>
<h2 id="loading-and-preprocessing-the-data">Loading and preprocessing the data</h2>
<p>The variables included in this dataset are:</p>
<ul>
<li>
<p><strong>steps</strong>: Number of steps taking in a 5-minute interval (missing
values are coded as <code class="language-plaintext highlighter-rouge">NA</code>)</p>
</li>
<li>
<p><strong>date</strong>: The date on which the measurement was taken in YYYY-MM-DD
format</p>
</li>
<li>
<p><strong>interval</strong>: Identifier for the 5-minute interval in which
measurement was taken</p>
</li>
</ul>
<p>The dataset is stored in a comma-separated-value (CSV) file and there
are a total of 17,568 observations in this
dataset.</p>
<p>Loading the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="s2">"activity.csv"</span><span class="p">)){</span><span class="w">
</span><span class="n">unzip</span><span class="p">(</span><span class="s2">"activity.zip"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s2">"activity.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">na.strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"NA"</span><span class="p">,</span><span class="w"> </span><span class="n">colClasses</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"integer"</span><span class="p">,</span><span class="w"> </span><span class="s2">"character"</span><span class="p">,</span><span class="w"> </span><span class="s2">"integer"</span><span class="p">))</span><span class="w">
</span><span class="n">data</span><span class="o">$</span><span class="n">date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="o">=</span><span class="s2">"%Y-%m-%d"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Checking the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">summary</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
</code></pre></div></div>
<h2 id="what-is-mean-total-number-of-steps-taken-per-day">What is mean total number of steps taken per day?</h2>
<p><em>For this part of the assignment, you can ignore the missing values in the dataset.</em></p>
<p>Summarizing the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">total_steps</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">steps</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">total_steps</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">total</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_histogram</span><span class="p">()</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Number of steps per day"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
</code></pre></div></div>
<p><img src="/assets/2019-10-23-Reproducible-Research_Activity-data/unnamed-chunk-4-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">total_steps</span><span class="o">$</span><span class="n">total</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 9354.23
</code></pre></div></div>
<p>The mean total number of steps per day is: <strong>9354.23</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">median</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">total_steps</span><span class="o">$</span><span class="n">total</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">median</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 10395
</code></pre></div></div>
<p>The median total number of steps per day is: <strong>10395</strong></p>
<h2 id="what-is-the-average-daily-activity-pattern">What is the average daily activity pattern?</h2>
<p>Summarizing the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">steps_interval</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">interval</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">steps</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">steps_interval</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">interval</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Average steps per interval"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">"Average steps"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="o">=</span><span class="s2">"Interval"</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-23-Reproducible-Research_Activity-data/unnamed-chunk-8-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">steps_interval</span><span class="o">$</span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="n">interval</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">steps_interval</span><span class="p">[[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="s2">"interval"</span><span class="p">]]</span><span class="w">
</span><span class="n">val</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">steps_interval</span><span class="p">[[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="s2">"mean"</span><span class="p">]]</span><span class="w">
</span></code></pre></div></div>
<p>The interval with max mean number of steps is <strong>835</strong> with a mean number of steps of <strong>206.17</strong>.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">h</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">floor</span><span class="p">(</span><span class="n">interval</span><span class="o">/</span><span class="m">60</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">interval</span><span class="o">%%</span><span class="m">60</span><span class="w">
</span></code></pre></div></div>
<p>Supposing the interval starts at 00:00 of each day, this interval corresponds to <strong>13:55</strong>.</p>
<h2 id="imputing-missing-values">Imputing missing values</h2>
<p>Total number of missing values:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">apply</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## steps date interval
## 2304 0 0
</code></pre></div></div>
<p>We will fill in the missing steps values with the mean for the specific day and interval.</p>
<p>First we compute the mean for each interval and day of the week.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">weekday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">wday</span><span class="p">))</span><span class="w">
</span><span class="n">fill_val</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">weekday</span><span class="p">,</span><span class="w"> </span><span class="n">interval</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">steps</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p>Imputing missing data.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data_nna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">row</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">data_nna</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data_nna</span><span class="p">[</span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="s2">"steps"</span><span class="p">]))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">wd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_nna</span><span class="p">[</span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="s2">"weekday"</span><span class="p">]</span><span class="w">
</span><span class="n">interval</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="s2">"interval"</span><span class="p">]</span><span class="w">
</span><span class="n">data_nna</span><span class="p">[</span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="s2">"steps"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fill_val</span><span class="p">[</span><span class="n">fill_val</span><span class="o">$</span><span class="n">weekday</span><span class="o">==</span><span class="n">wd</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">fill_val</span><span class="o">$</span><span class="n">interval</span><span class="o">==</span><span class="n">interval</span><span class="p">,</span><span class="w"> </span><span class="s2">"mean"</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">apply</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data_nna</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## steps date interval weekday
## 0 0 0 0
</code></pre></div></div>
<p>Repeating first steps of the assignement now with the imputed data.
Summarizing the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">total_steps_nna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_nna</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">steps</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">total_steps_nna</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">total</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_histogram</span><span class="p">()</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Number of steps per day"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
</code></pre></div></div>
<p><img src="/assets/2019-10-23-Reproducible-Research_Activity-data/unnamed-chunk-15-1.png" alt="" /><!-- --></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mean_nna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">total_steps_nna</span><span class="o">$</span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="n">mean_nna</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 10821.21
</code></pre></div></div>
<p>The mean total number of steps per day is: <strong>10821.21</strong> (was 9354.23 before imputation).</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">median_nna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">total_steps_nna</span><span class="o">$</span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="n">median_nna</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 11015
</code></pre></div></div>
<p>The median total number of steps per day is: <strong>11015.00</strong> (was 10395 before imputation).</p>
<h2 id="are-there-differences-in-activity-patterns-between-weekdays-and-weekends">Are there differences in activity patterns between weekdays and weekends?</h2>
<p>Summarizing the data:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">steps_interval_nna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_nna</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">week.part</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">if_else</span><span class="p">(</span><span class="n">weekday</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="s2">"weekend"</span><span class="p">,</span><span class="w"> </span><span class="s2">"weekdays"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">week.part</span><span class="p">,</span><span class="n">interval</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">steps</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">steps_interval_nna</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">interval</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="n">week.part</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">rows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vars</span><span class="p">(</span><span class="n">week.part</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Average steps per interval between weekdays and week"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">"Average steps"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="o">=</span><span class="s2">"Interval"</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/2019-10-23-Reproducible-Research_Activity-data/unnamed-chunk-19-1.png" alt="" /><!-- --></p>Baptiste MaingretActivity monitoring (part of Reproducible Research module by Johns Hopkins University) This assignment was part of the Johns Hopkins Coursera module on Reproducible Research as part of the Data Sciene Specialization.