Production Machine Learning Code
In my previous role, we had written transformations using Spark Structured Streaming in notebooks and scheduled them in Airflow using the Papermill operator. We lacked the internal expertise and therefore, didn’t see any issues with it at the time. We aren’t the only ones to make this blunder. Fortunately, the industry is becoming increasingly vocal about the fact that putting notebooks into production is bad practice.
You may be asking yourself whether: a) it’s really all that bad, and b) what are the alternatives. Let’s start with the first point, that is, why we should avoid running notebooks in production.
No Version Control
Notebooks do not have a proper versioning mechanism. This is in contrast to something like a Python .whl package.
No Unit Tests
First and foremost, Jupyer Notebooks do not natively support exporting specific functions. Even if you were to somehow import the functions into a separate test file. If that test file was a notebook itself, testing frameworks like pytest wouldn’t be able to find and run the tests.
No Linters
Jupyter Notebooks are not supported by the mainstream Python linters such as pylint, black and flake8.
No CI/CD
If you create a pull request containing a notebook, it’s difficult to review or leave inline comments as you see a bunch of JSON as the diff. Some tools can’t even render the notebook in the UI (at the time of this writing, Azure DevOps does not).
No Dependencies Management
By default, Jupyter Notebooks are not standalone. In other words, if one of your colleagues wants to run the notebook, they would need a requirements.txt, Pipfile.lock or some other manifest file in order to re-create the same environment that was previously used to run the notebook. This is in contrast to a pip package which will automatically install all of the dependencies for you.
Caching
In a Juypter notebook, once data is loaded, it remains in RAM until the kernel is shutdown. This can lead you into thinking that you need bigger machines to hold all the data in memory at once when in reality only a subset is being used at a given point in time.
State Dependent Execution
If you manually run the cells in a notebook in a production environment, you could run into issues if the cells are executed in a different order at a later point.