Description
If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:
1. So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.
2. In order to know where to focus memory optimization efforts.
dask-memusage is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.
dask-memusage polls your processes for memory usage and records the minimum and maximum usage in a CSV.
dask-memusage alternatives and similar packages
Based on the "Science and Data Analysis" category.
Alternatively, view dask-memusage alternatives based on common mentions on social networks and blogs.
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more -
NumPy
The fundamental package for scientific computing with Python. -
statsmodels
Statsmodels: statistical modeling and econometrics in Python -
Biopython
Official git repository for Biopython (originally converted from CVS) -
Interactive Parallel Computing with IPython
IPython Parallel: Interactive Parallel Computing in Python -
Cubes
[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis -
#<Sawyer::Resource:0x00007f547e829e00>
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites. -
bcbio-nextgen
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis -
bccb
Incubator for useful bioinformatics code, primarily in Python and R -
Neupy
NeuPy is a Tensorflow based python library for prototyping and building neural networks -
signac
Manage large and heterogeneous data spaces on the file system. -
PatZilla
PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources. -
Kotori
A flexible data historian based on InfluxDB, Grafana, MQTT, and more. Free, open, simple. -
cclib
A library for parsing and interpreting the results of computational chemistry packages. -
ElasticBatch
Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames -
Open Babel
A chemical toolbox designed to speak the many languages of chemical data.
Write Clean Python Code. Always.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of dask-memusage or a related project?
README
dask-memusage
If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:
- So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.
- In order to know where to focus memory optimization efforts.
dask-memusage
is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.
dask-memusage
polls your processes for memory usage and records the minimum and maximum usage in a CSV:
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
Usage
Important: Make sure your workers only have a single thread! Otherwise the results will be wrong.
Installation
On the machine where you are running the Distributed scheduler, run:
$ pip install dask_memusage
Or if you're using Conda:
$ conda install -c conda-forge dask-memusage
API usage
# Add to your Scheduler object, which is e.g. your LocalCluster's scheduler
# attribute:
from dask_memoryusage import install
install(scheduler, "/tmp/memusage.csv")
CLI usage
$ dask-scheduler --preload dask_memusage --memusage.csv /tmp/memusage.csv
Limitations
- Again, make sure you only have one thread per worker process.
- This is statistical profiling, running every 10ms. Tasks that take less than that won't have accurate information.
Help
Need help? File a ticket at https://github.com/itamarst/dask-memusage/issues/new
*Note that all licence references and agreements mentioned in the dask-memusage README section above
are relevant to that project's source code only.