Index

August 1, 2023
6 min read

5 Reasons why Code Comments are a Code Smell

Async Python

1. Comments that just describe what is happening

Let's have a look at the following code example:

##################
##### Imports ####
##################
import os.path

## Define path to executable:
path = "/usr/bin/executable"

## Extract directory: 
d = os.path.dirname(path) 

## Print Directory of executable:
print(d)

These comments just obvioulsy just describe what happens in the code below them. This is however not necessary, because the code is easy to read if we just name the variables better:

import os.path

path_to_executable = "/usr/bin/executable"
directory_of_executable = os.path.dirname(path_to_executable)
print(directory_of_executable)

2. Comments are a sign of too complex logic

Often, we see code comments when the code deals with complex or nested logic:

## We do have to check, that the credit card of the user is not expired and that the user is solvent.
if user.state.value == "solvent" and user.credit_card.get_expiration_dt < datetime.datetime.now().date:
    # If we process Visa cards, online payment service has to be enabled:
    if not (user.credit_card.type == "Visa" and "online_payment" in user.credit_card.payment_services):
        raise OnlinePaymentServiceNotEnabledException("Visa Payment not possible.") 
    process_payment(user.credit_card, payment_details)
else:
    raise PaymentException("...")

This is not optimal to read and also there is the danger, that the comments will be outdated (see below) after some time. It is in this case better to define small functions with a proper name that describe what happens within the code:

if is_solvent(user) and credit_card_is_active(credit_card:=user.credit_card):
    if not (credit_card.type == "Visa" and online_payment_service_enabled(credit_card)):
        raise OnlinePaymentServiceNotEnabledException("Visa Payment not possible.")
    process_payment(credit_card, payment_details)
else:
    raise PaymentException("...")


def is_solvent(user: User)->bool:
    return user.state.value == "solvent"

def credit_card_is_active(credit_card: CreditCard) -> bool:
    return credit_card.get_expiration_dt < datetime.datetime.now().date

def online_payment_service_enabled(credit_card: CreditCard)->bool:
    return "online_payment" in user.credit_card.payment_services

3. Comments that describe the implementation

def calculate_distance(x: Vector, y: Vector) -> float:
    """The distance is calculated by:
            1. Calculate the difference between each components of the 2 vectors
            2. Square the differences
            3. Sum up 
            4. Take the squareroot
    """

    square_sum = 0
    for i in range(3):
        difference = x[i] - y[i]
        square = difference**2
        square_sum += square

    return sqrt(square_sum)

Why do we need this? The code described perfectly what is happening and there is not need to describe the implementation, also because the implementation can be changed in different ways (e.g. using a high-level libary like numpy).

4. Comments can be outdated

When code is changed or refactored, the test suite should ensure that the business logic and functionality of the program is still valid. However, comments are like dead code and are likely to be untouched by the programmer at hand. Thus, if the implementation changes, the comments are more likely to not describe the logic anymore. Even worse, these outdated ...

5. ... Comments can be missleading

Comments can be missleading, because they are not well written ("hey, this is just a comment. No need to really be carefull writing these ...") or are outdated (like stated above).

Consider the following example:

## User is an adult:
if user_age >= 18:
    # User is active
    if user_is_active:
        process_payment()
    else:
        raise UserNotActiveException()
else:
    raise UserNotOldEnoughtException()

This is a pretty nested logic for such an easy business logic. Thus, we should invert the conditions to simplify the processing logic. So let us refactor this a bit

## User is an adult:
if user_age < 18:
    raise UserNotOldEnoughtException()
## User is active
if not user_is_active:
    raise UserNotActiveException()
process_payment()

This code is obviously better, so the refactoring was a good idea. However, if you look closely, we forgot to adjust the code comments and now they are missleading.

6. Outcommmented code

Since we all have a good version control system at hand today (#git), there is really no way of storing old or unfinished code in the production branch. Keep these dead code parts away from production, PERIOD.

7. Comments hide bad naming

Often, comments are used to describe a variable, because there was no effort to try to name the variable properly. Look at the following example:

import datetime


def check_token_validity(token_dt):
    # We check that the token timestamp is only 1 hour old
    return datetime.datetime.now().timestamp() - token_dt < 3600

There are several things that we can improve on this code snippet:

The name token_dt implies that we have a datetime object at hand. However, if we look closely, we see that it is a UTC Timestamp!
Doing calculation in comparisons is generally a bad idea, because it makes the expression to complex to directly understand it
What is the 3600 doing there? What is the unit and why was it chosen? We should make it clearer what this number means.

Now, let us look at the refactored code:

import datetime

MAXIMUM_TOKEN_LIFETIME_IN_SECONDS = 60 * 60

def check_token_validity(token_timestamp: float) -> bool:
    current_timestamp = datetime.datetime.now().timestamp()
    token_lifetime = current_timestamp - token_timestamp
    return token_lifetime < MAXIMUM_TOKEN_LIFETIME_IN_SECONDS

We have: 1.

Good Code Comments

After we have seen, that most common patterns of code comments are actually not a great idea, let us look at the exceptions:

1. Code documentation

Code documentation for public APIs like docstrings can be a great idea, especially for libraries that are consumed by 3rd parties. These docstrings can then be picked up by documentation tools like Sphinx or MKDocs for an automated generation of nice API documentation.

Note, that is in general not necessary to put a code comment on private functions that are not part of the public API, because they should in general not be used by external user and typically have a higher rate of change than the public APIs. Therefore, they are prone to outdated and missleading docstrings/comments as described above. Typically, for these functions you should invest in a good function name that describes properly what it does without the need of additional commentary. If this is not possible, this is often a sign of a function that does 2 or more things and should therefore be split up and refactored to reduce its complexity.

2. Links and additional information about business logic

If a programm is written as a feature of a e.g. story, it may be a good idea to reference the story at the high-level code interface. Thus, when another programmer wants to know, why a functionality has been developed in the first place, he can go back to the story or additional user documentation.

3. Additional details that explain the reasoning for the implementation if the latter is not obvious

There are code implementations or bugfixes that may not be obvious for a reader without the corresponding context. For these, there should be a reference to the corresponding issue or a comment describing the background information necessary to understand the maybe complex implementation. Think about something like this:

import time

run_id = start_run()

## We have to wait a few seconds for the server to respond after the submission of a run. If we query for the run status
## too early, we will get an error, because the internal database of the service has to catch up:
time.sleep(WAIT_TIME_IN_SECONDS)

while True:
    run_finished = poke_for_run(run_id)
    if run_finished:
        break

A programmer without detailed knowledge of the external system will need additional information about the intend of a chosen solution to understand the reasoning.

4. TODO Comments

Sometimes, when working on code we find a part of the code that is not well written or might have a bad side effect in the future, but we do not have the time to directly work on the problem. Or we see that we can improve an implementation and make it way more efficient. Leaving a # TODO comment can be a great way of finding those pieces in your code to later work on them.

5. Warnings

Sometimes, there is a strange side effect or just something that a user has to know before he is calling the function at hand. Here, a warning comment may be appropriate.

June 12, 2023
6 min read

Marp: Markdown Presentation Ecosystem

What is Marp?

Quote

Marp (also known as the Markdown Presentation Ecosystem) provides an intuitive experience for creating beautiful slide decks. You only have to focus on writing your story in a Markdown document.

Here you find a few good references if you wish to get started using Marp:

How to build a cool presentation

Marp is a really powerful tool to build an HTML presentation using Markdown. It can be developed inside VS Code (see e.g. https://www.youtube.com/watch?v=EzQ-p41wNEE) and the HTML can be directly exported there.

Let me give you an example presentation that shows how easy it is:

Marp Example Presenation

Example:

---
paginate: true
marp: true
theme: uncover
class: invert
style: |
  .small-red-text {
    font-size: 0.75rem;
    color: red;
  }

  h1 {
    font-size: 60px;
  }

  h2 {
    font-size: 50px;
  } 

  h3 {
    font-size: 40px;
  }

  h4 {
    font-size: 30px;
  }

  h5,h6,p,li,code,table {
    font-size: 25px;
  }
headingDivider: 1
math: mathjax
---


## **Marp**


![bg left:40% 80%](https://marp.app/assets/marp.svg)

Markdown Presentation Ecosystem

https://marp.app/


## How to write slides

Split pages by horizontal ruler (`---`). It's very simple! :satisfied:

```markdown
## Slide 1

foobar


## Slide 2

foobar
```

Alternatively, set the **directive** [headingDivider](https://marpit.marp.app/directives?id=heading-divider). To automatically split at `h1` headers add to the document metadata:
```md
headingDivider: 1
```


## Markdown !!!

*Just* **write** `Markdown` ==here==!!!

- Bullet 1
- Bullet 2

#### Subtitle

[Marp Link](https://marpit.marp.app/)

**Emojies**: :joy: :wink: :smile: :rocket:


## Markdown !!!
##### Use automatic syntax highlighting
```python
def foo(bar: str) -> str:
    return bar.strip()

if __name__ == "__main__":
    main()

```


## Markdown !!!

#### Tables

| Syntax      | Description | Test Text     |
| :---        |    :----:   |          ---: |
| Header      | Title       | Here's this   |
| Paragraph   | Text        | And more      |


## Markdown !!!

### Crazy Math
<!-- _footer: You have to add **math: mathjax** to the document metadata -->
$$ 
f(a) = \frac{1}{2 \pi i} \oint_\gamma \frac{f(z)}{z-a} dz
$$


## Custom Backgrounds

![bg](https://patrikhlobil.github.io/Blog/blogs/images/background.png)

Use **Image Directives** (local or web-resource):

```md
![bg](https://patrikhlobil.github.io/Blog/blogs/images/background.png)
```


## Custom Backgrounds

![bg](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)

![bg](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
**Multiple Backgrounds!!!**

```md
![bg](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)

![bg](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
```


## Images

Put Images to the left

![bg left](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```md
![bg left](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```

## Images

... or to the right

![bg right](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```md
![bg right](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```

## Images

... or adjust the width ...

![bg left:25%](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```md
![bg left:25%](https://patrikhlobil.github.io/Blog/blogs/images/cats.jpg)
```


## Images

... or apply filters:

![bg left sepia](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
![bg right blur](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
```md
![bg left sepia](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
![bg right blur](https://patrikhlobil.github.io/Blog/blogs/images/sea.jpg)
```


## GIFS !!!
```
![bg right](https://media0.giphy.com/media/SbtWGvMSmJIaV8faS8/200w.webp?
cid=ecf05e47i5eyozrdt5aytc41m0o7ea6uxu6ck2088uxo4ojz&ep=v1_gifs_search&rid=200w.webp&ct=g)
```
![bg](https://media0.giphy.com/media/SbtWGvMSmJIaV8faS8/200w.webp?cid=ecf05e47i5eyozrdt5aytc41m0o7ea6uxu6ck2088uxo4ojz&ep=v1_gifs_search&rid=200w.webp&ct=g)


## Mermaid Support

<!-- Add this anywhere in your Markdown file -->
<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true });
</script>

<div class="mermaid">
journey
    title My working day
    section Go to work
      Make tea: 5: Me
      Go upstairs: 3: Me
      Do work: 1: Me, Cat
    section Go home
      Go downstairs: 5: Me
      Sit down: 3: Me
</div>

## HTML Support

Add custom CSS to document metadata
```
---
marp: true
style: |
  .small-red-text {
    font-size: 0.75rem;
    color: red;
  }
---
## Title
<div class="small-red-text"> Small Red Text</div>
```


<div class="small-red-text"> Small Red Text</div>

## HTML Support

<div id="myDiv"> </div>
<script src='https://cdn.plot.ly/plotly-2.24.1.min.js'></script>
<script src='https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.17/d3.min.js'></script>

<script>
d3.json('https://raw.githubusercontent.com/plotly/plotly.js/master/test/image/mocks/sankey_energy.json', function(fig){

var data = {
  type: "sankey",
  domain: {
    x: [0,1],
    y: [0,1]
  },
  orientation: "h",
  valueformat: ".0f",
  valuesuffix: "TWh",
  node: {
    pad: 15,
    thickness: 15,
    line: {
      color: "black",
      width: 0.5
    },
   label: fig.data[0].node.label,
   color: fig.data[0].node.color
      },

  link: {
    source: fig.data[0].link.source,
    target: fig.data[0].link.target,
    value: fig.data[0].link.value,
    label: fig.data[0].link.label
  }
}

var data = [data]

var layout = {
  title: "Energy forecast for 2050<br>Source: Department of Energy & Climate Change, Tom Counsell via <a href='https://bost.ocks.org/mike/sankey/'>Mike Bostock</a>",
  width: 1000,
  height: 500,
  font: {
    size: 10
  }
}

Plotly.newPlot('myDiv', data, layout)
});

</script>


## Fragmented lists

Use `*` to get a fragmented list, where each item appears one after each other and `-` to directly display all items.

#### Non-Fragmented

- non-frag 1
- non-frag 2

#### Fragmented

* frag 1
* frag 2


## Speaker Notes

Just add **Speaker Notes** via:
```html
<!-- 
Some notes here that might be useful.
-->
```


<!-- 
Some notes here that might be useful.
-->

## Directives

<!-- _backgroundColor: #180f61 -->

With `Directives`, it is easy to modify the behaviour of **Marp** for a single slide or the whole presentation.

##### Local Directives

Using local directives (prepend with `_`), one can modify a single page, e.g.
```html
<!-- _backgroundColor: #180f61 -->
```

##### Global Directives

Using global directives, one can modify the bejaviour of the slides from the current slide on
```html
<!-- backgroundColor: aqua -->
```

April 23, 2023
18 min read

Async Python

Parallelism in Python

There exists 3 distinct ways to parallelize code in Python, namely threading, multiprocessing, and async. All 3 methods are based upon a different idea. However, the first 2 are a more indirect way to get parallelism, where the operating system's scheduler is involved, whereas the later introduces a completely new paradigm for programming in python (async/await).

In this article, we will quickly introduce each of the concepts, apply them for 2 code examples and performing benchmarks to compare them. Let us first introduce the 2 scripts that we are going to parallelize.

The first example sleep.py introduces a simple script, that executed a function do_work 5 times. The function itself just sleeps for 2 seconds, mocking the waiting to an external resource like a database of an HTTPS response:

sleep.py

import time
import datetime


def do_work(number: int):
    print(f"{datetime.datetime.now() - start_time}: Start work for {number=}")
    time.sleep(2)
    print(f"{datetime.datetime.now() - start_time}: Finished work for {number=}")


start_time = datetime.datetime.now()
for i in range(5):
    do_work(i)
print(f"Finished program after {datetime.datetime.now() - start_time}")

The second example hard_work.py has the same structure, however the Python function do_real_work performs actual work on the CPU and calculates a sum using a for loop.

hard_work.py

import datetime


def do_real_work(number: int):
    print(f"{datetime.datetime.now()-start_time}: Start work for {number=}")
    output = 0
    for i in range(100_000_000):
        output += 1
    print(f"{datetime.datetime.now()-start_time}: Finished work for {number=}")

start_time = datetime.datetime.now()
for i in range(5):
    do_real_work(i)
print(f"Finished program after {datetime.datetime.now() - start_time}")

Synchronous execution

When we execute the 2 scripts above, we get the following output

>>> python3 sleep.py

0:00:00.000010: Start work for number=0
0:00:02.005169: Finished work for number=0
0:00:02.005421: Start work for number=1
0:00:04.008039: Finished work for number=1
0:00:04.008167: Start work for number=2
0:00:06.010063: Finished work for number=2
0:00:06.010195: Start work for number=3
0:00:08.011329: Finished work for number=3
0:00:08.011481: Start work for number=4
0:00:10.016531: Finished work for number=4
Finished program after 0:00:10.016634

and:

>>> python3 hard_work.py

0:00:00.000005: Start work for number=0
0:00:02.352451: Finished work for number=0
0:00:02.352548: Start work for number=1
0:00:04.687044: Finished work for number=1
0:00:04.687078: Start work for number=2
0:00:07.084300: Finished work for number=2
0:00:07.084339: Start work for number=3
0:00:09.425129: Finished work for number=3
0:00:09.425161: Start work for number=4
0:00:11.770652: Finished work for number=4
Finished program after 0:00:11.770680

Thus we can see the following execution times¹:

	Synchronous Execution
sleep.py	10
do_work.py	11.8

Threading

Using the threading module in Python, one cas easily execute parallel tasks. We then get for the first example:

CodeOutput

sleep.py

import time
import datetime
from threading import Thread


def do_work(number: int):
    print(f"{datetime.datetime.now() - start_time}: Start work for {number=}")
    time.sleep(2)
    print(f"{datetime.datetime.now() - start_time}: Finished work for {number=}")


start_time = datetime.datetime.now()

tasks = [Thread(target=do_work, args=(i,)) for i in range(5)]
for task in tasks:
    task.start()
for task in tasks:
    task.join()

print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 sleep.py

0:00:00.000071: Start work for number=0
0:00:00.000126: Start work for number=1
0:00:00.000168: Start work for number=2
0:00:00.000205: Start work for number=3
0:00:00.000246: Start work for number=4
0:00:02.005241: Finished work for number=0
0:00:02.005388: Finished work for number=4
0:00:02.005453: Finished work for number=1
0:00:02.005475: Finished work for number=2
0:00:02.005430: Finished work for number=3
Finished program after 0:00:02.006002

As can be seen, the program has finished in only 2 seconds instead of 10 seconds, because alls tasks have been processed in parallel and have a runtime of 2 seconds.

However, when we do the same for the other example:

CodeOutput

hard_work.py

import datetime
from threading import Thread

def do_real_work(number: int):
    print(f"{datetime.datetime.now()-start_time}: Start work for {number=}")
    output = 0
    for i in range(100_000_000):
        output += 1
    print(f"{datetime.datetime.now()-start_time}: Finished work for {number=}")

start_time = datetime.datetime.now()

tasks = [Thread(target=do_real_work, args=(i,)) for i in range(5)]
for task in tasks:
    task.start()
for task in tasks:
    task.join()

print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 hard_work.py

0:00:00.000125: Start work for number=0
0:00:00.000200: Start work for number=1
0:00:00.019017: Start work for number=2
0:00:00.049231: Start work for number=3
0:00:00.080564: Start work for number=4
0:00:10.813826: Finished work for number=2
0:00:10.885100: Finished work for number=3
0:00:10.988489: Finished work for number=0
0:00:11.014248: Finished work for number=1
0:00:11.020219: Finished work for number=4
Finished program after 0:00:11.020369

we still need 11 seconds like in the synchronous execution. The reason for this lies in the way threading is handled by the CPython interpreter. Due to the Global Interpreter Lock (GIL), Python ensures that at each moment only one thread can be actively executing the code. Therefore, we are executing multiple functions in parallel, but we are effectively still using only 1 core at a time. This can be seen in the image below.

Threading and the GIL — Work of a Python code with threads (GIL blocking of true parallelism)

In the sleep example above, the function was basically doing nothing except waiting (mimicking for example a wait time due to a database or HTTP request). Therefore, the parallel execution works fine, because there is no real work to be done.

For the hard_work.py code snippet, the execution time is really spent doing actual Python code execution. As can be seen in the output, all 5 tasks will start nearly the same time, but since it will only get 1/5 of the CPU, each task will need about 5 times as much time. As a result, there is no speed improvement when using threading in Python for computiationally expensive applications.

The observed execution times so far:

	Synchronous Execution	Threading
sleep.py	10	2
do_work.py	11.8	11

Multiprocessing

Real multiprocessing can be used in Python very similar to Threads. Note that in this case, the processed are encapulated in seperate child processed and cannot (easily) access the data from other processes. Let us again modify our two code examples:

CodeOutput

sleep.py

import time
import datetime
from multiprocessing import Process


def do_work(number: int):
    print(f"{datetime.datetime.now() - start_time}: Start work for {number=}")
    time.sleep(2)
    print(f"{datetime.datetime.now() - start_time}: Finished work for {number=}")


start_time = datetime.datetime.now()

tasks = [Process(target=do_work, args=(i,)) for i in range(5)]
for task in tasks:
    task.start()
for task in tasks:
    task.join()

print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 sleep.py

0:00:00.006796: Start work for number=0
0:00:00.007449: Start work for number=1
0:00:00.008078: Start work for number=2
0:00:00.008823: Start work for number=3
0:00:00.009255: Start work for number=4
0:00:02.008293: Finished work for number=0
0:00:02.008291: Finished work for number=1
0:00:02.009784: Finished work for number=3
0:00:02.009597: Finished work for number=2
0:00:02.010936: Finished work for number=4
Finished program after 0:00:02.013174

CodeOutput

hard_work.py

import datetime
from multiprocessing import Process

def do_real_work(number: int):
    print(f"{datetime.datetime.now()-start_time}: Start work for {number=}")
    output = 0
    for i in range(100_000_000):
        output += 1
    print(f"{datetime.datetime.now()-start_time}: Finished work for {number=}")

start_time = datetime.datetime.now()

tasks = [Process(target=do_real_work, args=(i,)) for i in range(5)]
for task in tasks:
    task.start()
for task in tasks:
    task.join()

print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 hard_work.py

0:00:00.007729: Start work for number=0
0:00:00.008396: Start work for number=1
0:00:00.008928: Start work for number=2
0:00:00.011777: Start work for number=3
0:00:00.012632: Start work for number=4
0:00:02.649805: Finished work for number=2
0:00:02.655102: Finished work for number=1
0:00:02.688040: Finished work for number=3
0:00:02.738391: Finished work for number=4
0:00:02.888424: Finished work for number=0
Finished program after 0:00:02.889433

Now, we finally see real Python use several cores of our CPU to execute the tasks in parallel. This leads us to the following execution times:

	Synchronous Execution	Threading	Multiprocessing
sleep.py	10	2	2
do_work.py	11.8	11	2.8

Asynchronous Python

The last paradigm to execute Python code in parallel is the use of async/await. This method kind of resembles the threading idea. However, in this case it is not the OS Scheduler that is assigning resources to the tasks/threads, but the async event loop scheduler of Python². There are several key advantages in comparison to regular threading:

Reduces overhead since no additional threads have to be scheduled by the OS
More parallel tasks can be scheduled in parallel
async/await allows for control, where tasks can be paused and resumed

Let us now have a look at the sleep example:

CodeOutput

sleep.py
import datetime
import asyncio


async def do_work(number: int):
    print(f"{datetime.datetime.now() - start_time}: Start work for {number=}")
    await asyncio.sleep(2)
    print(f"{datetime.datetime.now() - start_time}: Finished work for {number=}")


async def main():
    tasks = [asyncio.create_task(do_work(i)) for i in range(5)]
    for task in tasks:
        await task


if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 sleep.py

0:00:00.000177: Start work for number=0
0:00:00.000206: Start work for number=1
0:00:00.000214: Start work for number=2
0:00:00.000219: Start work for number=3
0:00:00.000224: Start work for number=4
0:00:02.001531: Finished work for number=0
0:00:02.001620: Finished work for number=1
0:00:02.001637: Finished work for number=2
0:00:02.001648: Finished work for number=3
0:00:02.001660: Finished work for number=4
Finished program after 0:00:02.002334

Similar to the threading and multiprocessing case, we could speed up the program by about a factor 5 here. The key differences here is the way we layout our program:

Using the await keyword, we can tell the event loop to pause the execution of the function and work on another task. The event loop will then execute another task, until another await will be called, where the scheduler will give the control to yet another task. In the code above, each part where we tell the event loop that it can temporarily suspend the active task is highlighted. In the real world, these awaited calls often go to systems where we have to wait for a response like Databases, IO or HTTP Calls.
In line 19, we define the async event loop, that will be responsible for scheduling the resources between the tasks, and tell him to execute our async `main function.
Tasks can be scheduled using asyncio.create_task, which will then be executed in parallel. Calling await on the tasks will wait for the task to be executed (similar to Thread.join). If the async function has a return value, it can just be assigned via value = await task.
Only async functions can be awaited. Note, that you can just call regular synchronous code (like print) inside async functions, however it is not possible to call async functions from synchronous code!

Asynchronous Python Code allow for true parallelism without the use of Threads or Processes, however there is still only one execution of Python Code at each time. Therefore, the hard work code:

CodeOutput

hard_work.py

async def do_real_work(number: int):
    print(f"{datetime.datetime.now()-start_time}: Start work for {number=}")
    output = 0
    for i in range(100_000_000):
        output += 1
    print(f"{datetime.datetime.now()-start_time}: Finished work for {number=}")


async def main():
    tasks = [asyncio.create_task(do_real_work(i)) for i in range(5)]
    for task in tasks:
        await task


if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 hard_work.py

0:00:00.000176: Start work for number=0
0:00:02.103602: Finished work for number=0
0:00:02.103778: Start work for number=1
0:00:04.209974: Finished work for number=1
0:00:04.210021: Start work for number=2
0:00:06.322592: Finished work for number=2
0:00:06.322637: Start work for number=3
0:00:08.446054: Finished work for number=3
0:00:08.446091: Start work for number=4
0:00:10.562800: Finished work for number=4
Finished program after 0:00:10.563345

does not show a big improvement like in the multiprocessing example, leading to the final execution time overview:

	Synchronous Execution	Threading	Multiprocessing	Async/Await
sleep.py	10	2	2	2
do_work.py	11.8	11.02	2.8	10.6

Deeper dive into Async/Await

Parallel HTTP Requests

The real strength of asynchronous Code can be seen, if you are accessing external systems, where you have to wait for the response and the event loop can use the waiting time to do other things. Let us consider the following example of calling the PokeAPI to list us details about each Pokemon. We are using the awesome httpx libary, which is an async drop-in replacement for requests:

CodeOutput

unparallel_async.py

import asyncio
import datetime

import httpx

async def main():
    async with httpx.AsyncClient(base_url="") as client:
        all_pokemon_response = await client.get("https://pokeapi.co/api/v2/pokemon", params={"limit": 10000})
        all_pokemon_response.raise_for_status()
        all_pokemon = all_pokemon_response.json()["results"]

        for pokemon in all_pokemon:
            print(f"Get information for `{pokemon['name']}`")
            pokemon_details_response = await client.get(pokemon["url"])
            pokemon_details_response.raise_for_status()
            pokemon_details = pokemon_details_response.json()
            print(f'ID: {pokemon_details["id"]}, Name: {pokemon_details["name"]}, Height: {pokemon_details["height"]}, Weight: {pokemon_details["weight"]}')

if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")`

>>> python3 unparallel_async.py

Get information for `bulbasaur`
ID: 1, Name: bulbasaur, Height: 7, Weight: 69
Get information for `ivysaur`
ID: 2, Name: ivysaur, Height: 10, Weight: 130
Get information for `venusaur`
ID: 3, Name: venusaur, Height: 20, Weight: 1000

...
Get information for `miraidon-glide-mode`
ID: 10271, Name: miraidon-glide-mode, Height: 28, Weight: 2400
Finished program after 0:03:15.133072

Downloading the details about ~1000 Pokemon took about 3 minutes for the program. This program seems to be asynchronous, but if you look closely, we are never calling an asyncio.create_task or asyncio.gather function that schedules async tasks in parallel. If you inspect the output you will see that the program just iterates through all Pokemon URLs and each times waits until each single request is finished.

With a little change, the program can be refactored to be fully asynchronous:

CodeOutput

parallel_async.py

import asyncio
import datetime

import httpx


async def print_pokemon_details(client: httpx.AsyncClient, pokemon: dict[str, str]):
    print(f"Get information for `{pokemon['name']}`")
    pokemon_details_response = await client.get(pokemon["url"])
    pokemon_details_response.raise_for_status()
    pokemon_details = pokemon_details_response.json()
    print(
        f'ID: {pokemon_details["id"]}, Name: {pokemon_details["name"]}, Height: {pokemon_details["height"]}, Weight: {pokemon_details["weight"]}'
    )


async def main():
    async with httpx.AsyncClient(base_url="") as client:
        all_pokemon_response = await client.get(
            "https://pokeapi.co/api/v2/pokemon", params={"limit": 10000}
        )
        all_pokemon_response.raise_for_status()
        all_pokemon = all_pokemon_response.json()["results"]

        await asyncio.gather(
            *[print_pokemon_details(client, pokemon) for pokemon in all_pokemon]
        )


if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 parallel_async.py

Get information for `bulbasaur`
Get information for `ivysaur`
Get information for `venusaur`
Get information for `charmander`
Get information for `charmeleon`
...
Get information for `miraidon-aquatic-mode`
Get information for `miraidon-glide-mode`
ID: 1, Name: bulbasaur, Height: 7, Weight: 69
ID: 4, Name: charmander, Height: 6, Weight: 85
ID: 3, Name: venusaur, Height: 20, Weight: 1000
...
ID: 10187, Name: morpeko-hangry, Height: 3, Weight: 30
ID: 10226, Name: urshifu-single-strike-gmax, Height: 290, Weight: 10000
ID: 10131, Name: minior-yellow-meteor, Height: 3, Weight: 400
Finished program after 0:00:04.860879

This async version is about 50 times faster than the previous one! As can be seen, all ~1000 functions have first started and then requested and printed the result. The execution flow is shown below

Async Execution Flow for Pokemon API Example

Building async pipeline

So far, we used asyncio.gather or asyncio.create_task + wait task to create tasks in parallel. However, for this to work, we have to know which tasks we want to execute in advance. Let us now consider an example, where we have a slow service, that serves us with the URLs to request the Pokemon details:

CodeOutput

pokemon_pipeline.py

import asyncio
import datetime
from typing import AsyncIterator

import httpx


async def get_pokemons(client: httpx.AsyncClient) -> AsyncIterator[dict]:
    all_pokemon_response = await client.get(
        "https://pokeapi.co/api/v2/pokemon", params={"limit": 10000}
    )
    all_pokemon_response.raise_for_status()
    for pokemon in all_pokemon_response.json()["results"]:
        print(f"Get Url for Pokemon {pokemon['name']}")
        # This is a slow producer, so we have to sleep:
        await asyncio.sleep(0.01)
        yield pokemon


async def print_pokemon_details(client: httpx.AsyncClient, pokemon: dict[str, str]):
    print(f"Get information for `{pokemon['name']}`")
    pokemon_details_response = await client.get(pokemon["url"])
    pokemon_details_response.raise_for_status()
    pokemon_details = pokemon_details_response.json()
    print(
        f'ID: {pokemon_details["id"]}, Name: {pokemon_details["name"]}, Height: {pokemon_details["height"]}, Weight: {pokemon_details["weight"]}'
    )


async def main():
    async with httpx.AsyncClient(base_url="") as client:
        pokemons = get_pokemons(client)
        await asyncio.gather(
            *[print_pokemon_details(client, pokemon) async for pokemon in pokemons]
        )


if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 pokemon_pipeline.py

    Get Url for Pokemon bulbasaur
    Get Url for Pokemon ivysaur
    Get Url for Pokemon venusaur
    ...
    Get information for `bulbasaur`
    Get information for `ivysaur`
    Get information for `venusaur`
    ...
    ID: 10084, Name: pikachu-libre, Height: 4, Weight: 60
    ID: 879, Name: copperajah, Height: 30, Weight: 6500
    Finished program after 0:00:18.303518

As can be seen in the output , we first get all Pokemon URLs and after that we start our asynchronous HTTP Calls to get the Pokemon Details. This is not what we want. However, there is a way to already start the downloads of the details when we get our first URLs by using a Queue Mechanism:

CodeOutput

pokemon_pipeline.py

import asyncio
import datetime
from typing import AsyncIterator

import httpx


async def pokemons_producer(
        client: httpx.AsyncClient, pokemons: asyncio.Queue
) -> AsyncIterator[dict]:
    all_pokemon_response = await client.get(
        "https://pokeapi.co/api/v2/pokemon", params={"limit": 10000}
    )
    all_pokemon_response.raise_for_status()
    for pokemon in all_pokemon_response.json()["results"]:
        print(f"Get Url for Pokemon {pokemon['name']}")
        # This is a slow producer, so we have to sleep:
        await asyncio.sleep(0.01)
        await pokemons.put(pokemon)
    await pokemons.put(None)


async def print_pokemon_details(client: httpx.AsyncClient, pokemon: dict):
    print(f"Get information for `{pokemon['name']}`")
    pokemon_details_response = await client.get(pokemon["url"])
    pokemon_details_response.raise_for_status()
    pokemon_details = pokemon_details_response.json()
    print(
        f'ID: {pokemon_details["id"]}, Name: {pokemon_details["name"]}, Height: {pokemon_details["height"]}, Weight: {pokemon_details["weight"]}'
    )


async def pokemon_details_consumer(client: httpx.AsyncClient, pokemons: asyncio.Queue):
    consumer_active = True
    while consumer_active:
        await asyncio.sleep(0.05)
        pokemons_to_process = []
        while not pokemons.empty():
            if (pokemon := await pokemons.get()) is not None:
                pokemons_to_process.append(pokemon)
            else:
                consumer_active = False
        await asyncio.gather(
            *(print_pokemon_details(client, pokemon) for pokemon in pokemons_to_process)
        )


async def main():
    async with httpx.AsyncClient(base_url="") as client:
        pokemons = asyncio.Queue()
        pokemon_producer_task = asyncio.create_task(pokemons_producer(client, pokemons))
        pokemon_details_consumer_task = asyncio.create_task(
            pokemon_details_consumer(client, pokemons)
        )
        await asyncio.gather(*[pokemon_producer_task, pokemon_details_consumer_task])


if __name__ == "__main__":
    start_time = datetime.datetime.now()
    asyncio.run(main())
    print(f"Finished program after {datetime.datetime.now() - start_time}")

>>> python3 pokemon_pipeline.py

Get Url for Pokemon bulbasaur
Get Url for Pokemon ivysaur
Get Url for Pokemon venusaur
Get Url for Pokemon charmander
Get information for `bulbasaur`
Get information for `ivysaur`
Get information for `venusaur`
Get information for `charmander`
Get Url for Pokemon charmeleon
Get Url for Pokemon charizard
ID: 1, Name: bulbasaur, Height: 7, Weight: 69
Get Url for Pokemon squirtle
Get Url for Pokemon wartortle
Get Url for Pokemon blastoise
ID: 2, Name: ivysaur, Height: 10, Weight: 130
ID: 4, Name: charmander, Height: 6, Weight: 85
...
Finished program after 0:00:15.070663

Here, we have a producer (generates the Pokemon Details URLs) and a consumer task (exports details from URL). They communicate with each other via an asyncio.Queue, such that both can operate in parallel and we can start extracting the Pokemon Details as soon as the first URL has been published by the produced task. Using the same idea, it is then of course possible to generate a cascade of processes, where each intermediate task consumes from one task and produces objects that will then be processed by the next task.

March 8, 2023
4 min read

Caution with Multiprocessing and ffspec Filesystems in Python

Multiprocessing can be a great way of scaling Python applications beyond one CPU core, especially due slow speed compared to other lower-level languages like C or Java. However, one has to be very careful when passing Python class instance, especially if your are using a fsspec filesystem implementation.

Note: At time of writing of this article the newest fsspec release was 2023.3.0.

The Problem

Filesystem Spec (fsspec) is a Python package that provides a unified interface for filesystems. However, one has to be very carefull, when running tasks on an fsspec Filesystem in parallel. Let us consider the following code:

import multiprocessing
from fsspec import AbstractFileSystem

multiprocessing.set_start_method("fork")


class MyClass:
    def __init__(self):
        self.store: dict[str, str] = {}


class MyFS(AbstractFileSystem):
    def __init__(self):
        self.store: dict[str, str] = {}


def return_store(my_class: MyClass) -> str:
    return f"Class Name: {my_class.__class__.__name__: <10} Store: {my_class.store}"


## Create class instance in parent process and add stuff to the internal store:
my_class = MyClass()
my_class.store["added"] = "content"

my_fs = MyFS()
my_fs.store["added"] = "content"


## Run the function in a child process:
with multiprocessing.Pool(processes=2) as pool:
    messages = pool.map(
        return_store,
        [my_class, my_fs],
    )


## Print out store content from child processes:
for message in messages:
    print(message)

What we have here are two very simple classes MyClass and MyFS that just create an empty dictionary store attribute when instantiated. The key difference between these two classes is that MyFS is inherited from fsspec.AbstractFileSystem. We then create an instance of each of the two classes and add an entry to the store dictionary. Finally, we run a simple function in 2 child processes that just prints out the content of the store attribute. The resulting output of this script will be:

Class Name: MyClass    Store: {'added': 'content'}
Class Name: MyFS       Store: {}

As can be seen, the store attribute of the MyFS class does not contain the added entry, whereas the same class without inheritance from the AbstractFileSystem does contain the non-empty dictionary. The reason behind these two different behaviours lies in the caching implementation of fsspec. The inheritance diagram looks like:

---
title: Class Diagram MyFS
---
classDiagram
    `fsspec.spec._Cached` <|-- `fsspec.AbstractFileSystem`
    `fsspec.AbstractFileSystem` <|-- `MyFS`

    class `fsspec.spec._Cached`{
        __init__()
        __call__()
    }
    class `fsspec.AbstractFileSystem`{
        + ...
    }

This line in the _Cached base is responsible for removing all existing attributes on the class instance when it will be forked for the child process:

class _Cached(type):
    ...
    def __call__(cls, *args, **kwargs):
        ...
        if os.getpid() != cls._pid:
            # In a child process, this line is called and will clear all existing attributes:
            cls._cache.clear()        
        ...

Possible solutions

If you really want to parallelize applications with an fsspec Filesystem, there are at least 2 solutions.

Use threading

If you run the code above with a multiprocessing.pool.ThreadPool instead of a multiprocessing.Pool, it just runs fine and does not show the observed caching behaviour.

Instantiate the filesystem inside the processes itself

Instead of passing the instatiated filesystem directly into the subprocess, it is better to use an Adapter that stores the information about the filesystem and can create an instance of it. Let us consider the following example:

import multiprocessing
from fsspec.implementations.github import GithubFileSystem

multiprocessing.set_start_method("fork")


class GithubFileSystemAdapter:
    def __init__(self, org: str, repo: str):
        self.org = org
        self.repo = repo

    def get_github_fs(self) -> GithubFileSystem:
        return GithubFileSystem(org=self.org, repo=self.repo)


def list_root_directory_of_repository(
    github_fs_adapter: GithubFileSystemAdapter,
) -> list[str]:
    # Create fsspec filesystem inside child process:
    github_fs = github_fs_adapter.get_github_fs()
    return github_fs.ls("/")


## Create GithubFileSystemAdapter for 'https://github.com/python/cpython':
github_fs_adapter = GithubFileSystemAdapter(org="python", repo="cpython")


## Run the function in a child process:
with multiprocessing.Pool(processes=1) as pool:
    messages = pool.map(
        list_root_directory_of_repository,
        [github_fs_adapter],
    )

## Print out store content from child process:
for message in messages:
    print(message)

As can be seen, we define a container class GithubFileSystemAdapter that simply stores the data and implements a method for creating the filesystem. This container class is then passed to the child process, within which the fsspec filessystem object is created. It can then be safely used without the problems seen in the previous example.