Contributions

Article
CSV, JSON, Parquet—which file format should you use for data being processed by Pandas?
Article
You’re on a new version of Linux, you try a pip install, and it errors out, talking about “externally managed environments” and “PEP 668”. What’s going on? How do you solve this?
Article
Ruff is a new, much faster linter for Python, to help you catching bugs without waiting forever for CI.
Article
Initial and exploratory data analysis have different requirements than production data processing; Polars supports both.
Article
While multiprocessing allows Python to scale to multiple CPUs, it has some performance overhead compared to threading.
Article
Estimating Pandas memory usage from the data file size is surprisingly difficult. Learn why, and some alternative approaches that don’t require estimation.
Article
Switching from float64 (double-precision) to float32 (single-precision) can cut memory usage in half. But how do you deal with data that doesn’t fit?
Article
If you need to speed up Python, Cython is a very useful tool. It lets you seamlessly merge Python syntax with calls into C or C++ code, making it easy to write high-performance extensions with rich Python interfaces.

That being said, Cython is not the best tool in all circumstances. So in this article I’ll go over some of the limitations and problems with Cython, and suggest some alternatives.
Article
While Polars is mostly known for running faster than Pandas, if you use it right it can sometimes also significantly reduce memory usage compared to Pandas. In particular, certain techniques that you need to do manually in Pandas can be done automatically in Polars, allowing you to process large datasets without using as much memory—and with less work on your side!
Article
Python 3.7 end of life is in 6 months; after that there will be no more security updates. So the time to upgrade is now.
Article
The libraries you’re using might be running more threads than you realize—and that can mean slower execution.
Article
Python 3.11 is out now–but should you switch to it immediately? And if you shouldn’t upgrade just yet, when should you?
Article
Your data processing jobs are fast… most of the time. Next, find the slow runs so you can speed them up.
Article
Learn a variety of—sometimes horrible—ways to instrument and measure performance in Python.
Article
Vectorization is a great way to speed up your Python code, but you’re limited to specific operations on bulk data. Learn how to get pass these limitations.
Article
Learn how to speed up your Celery tasks by identifying slow tasks, and then finding the performance bottleneck using a profiler.
Article
Vectorization in Pandas can make your code faster—except when it will make your code slower.
Article
Installing packages with pip, Poetry, and Pipenv can be slow. Learn how to ensure it’s not even slower, and a potential speed-up.
Article
msgspec is a schema-based JSON encoder/decoder, which allows you to process large files with lower memory and CPU usage.
Article
Python’s Global Interpreter Lock (GIL) stops threads from running in parallel or concurrently. Learn how to determine impact of the GIL on your code.
Tutorial
Learn how to read CSVs in Pandas that much faster.
Article
Vectorization allows you to speed up processing of homogeneous data in Python. Learn what it means, when it applies, and how to do it.
Article
Python 3.6 will stop getting security updates in December 2021. Given the existence of 3.7, 3.8, 3.9, and 3.10, you really should upgrade.
Article
Conda installs are very slow, but you can speed them with a much-faster Conda reimplementation called Mamba.
Article
You can write Python extensions with Cython, Rust, and many other tools. Learn which one you should use, depending on your particular needs.
Article
Python has two packaging systems, pip and Conda. Learn the differences between them so you can pick the right one for you.
Article
Python 3.10 is out now, but you won't be able to switch for a while: there's missing packages, missing toolchain updates, and more.
Article
Learn how to scan your Conda package dependencies for security vulnerabilities.
Tutorial
NumPy provides memory views transparently, as a way to save memory. But you need to understand how they work, because if you’re not careful you can also leak memory, or even modify data in ways you didn’t expect.
Article
When you’re loading many strings into Pandas, you’re going to use a lot of memory. If you have only a limited number of strings, you can save memory with categoricals, but that’s only helpful in a limited number of situations.

With Pandas 1.3, there’s a new option that can save memory on large number of strings as well, simply by changing to a new column type.

Showing the last 30 only...