Popularity

0.9

Stable

Activity

0.0

Stable

Stars 24

Watchers 4

Forks 1

Last Commit about 1 year ago

Description

If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:

1. So you can set the highest parallelism level (process or threads) for each machine, given available to RAM. 2. In order to know where to focus memory optimization efforts.

dask-memusage is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.

dask-memusage polls your processes for memory usage and records the minimum and maximum usage in a CSV.

Programming language: Python

License: MIT License

Tags: Profiler Science And Data Analysis Scientific Distributed Computing

Latest version: v1.1

dask-memusage alternatives and similar packages

Based on the "Science and Data Analysis" category.
Alternatively, view dask-memusage alternatives based on common mentions on social networks and blogs.

Pandas

9.9 10.0 L2 dask-memusage VS Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
NumPy

9.8 10.0 L1 dask-memusage VS NumPy

The fundamental package for scientific computing with Python.

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

SciPy

9.4 9.9 L2 dask-memusage VS SciPy

SciPy library main repository
SymPy

9.4 10.0 L2 dask-memusage VS SymPy

A computer algebra system written in pure Python
NetworkX

9.4 9.6 L3 dask-memusage VS NetworkX

Network Analysis in Python
Dask

9.2 9.7 L2 dask-memusage VS Dask

Parallel computing with task scheduling
statsmodels

9.2 9.4 L3 dask-memusage VS statsmodels

Statsmodels: statistical modeling and econometrics in Python
Numba

8.9 9.9 L3 dask-memusage VS Numba

NumPy aware dynamic Python compiler using LLVM
PyMC

8.9 9.4 L4 dask-memusage VS PyMC

Bayesian Modeling and Probabilistic Programming in Python
Getting Started

8.7 9.6 dask-memusage VS Getting Started

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis
Biopython

8.3 9.6 L2 dask-memusage VS Biopython

Official git repository for Biopython (originally converted from CVS)
astropy

8.2 9.9 L2 dask-memusage VS astropy

Astronomy and astrophysics core library
orange

8.1 9.6 L2 dask-memusage VS orange

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
Interactive Parallel Computing with IPython

7.4 8.3 L3 dask-memusage VS Interactive Parallel Computing with IPython

IPython Parallel: Interactive Parallel Computing in Python
blaze

7.3 0.0 L4 dask-memusage VS blaze

NumPy and Pandas interface to Big Data
RDKit

7.1 9.5 L1 dask-memusage VS RDKit

The official sources for the RDKit library
Cubes

5.9 0.0 L3 dask-memusage VS Cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis
Open Mining

5.7 0.0 L3 dask-memusage VS Open Mining

Business Intelligence (BI) in Python, OLAP
#<Sawyer::Resource:0x00007f547e829e00>

5.6 6.7 dask-memusage VS #<Sawyer::Resource:0x00007f547e829e00>

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
bcbio-nextgen

5.4 6.7 L3 dask-memusage VS bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
NIPY

5.4 8.9 L3 dask-memusage VS NIPY

Workflows and interfaces for neuroimaging packages
bcolz

4.7 0.0 dask-memusage VS bcolz

A columnar data container that can be compressed.
bccb

4.5 4.4 L4 dask-memusage VS bccb

Incubator for useful bioinformatics code, primarily in Python and R
Neupy

4.4 0.0 L5 dask-memusage VS Neupy

NeuPy is a Tensorflow based python library for prototyping and building neural networks
Bubbles

3.6 0.0 L5 dask-memusage VS Bubbles

[NOT MAINTAINED] Bubbles – Python ETL framework
PyDy

3.5 3.9 L3 dask-memusage VS PyDy

Multibody dynamics tool kit.
signac

2.4 8.4 dask-memusage VS signac

Manage large and heterogeneous data spaces on the file system.
harold

2.4 1.8 L2 dask-memusage VS harold

An open-source systems and controls toolbox for Python3
LynxKite

2.1 4.3 dask-memusage VS LynxKite

The complete graph data science platform
PatZilla

2.0 5.4 dask-memusage VS PatZilla

PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.
Kotori

2.0 6.8 dask-memusage VS Kotori

A flexible data historian based on InfluxDB, Grafana, MQTT, and more. Free, open, simple.
Terkin

1.8 0.0 dask-memusage VS Terkin

Datalogger for MicroPython and CPython.
cclib

0.9 dask-memusage VS cclib

A library for parsing and interpreting the results of computational chemistry packages.
ElasticBatch

0.8 0.0 dask-memusage VS ElasticBatch

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames
Open Babel

- dask-memusage VS Open Babel

A chemical toolbox designed to speak the many languages of chemical data.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of dask-memusage or a related project?

Add another 'Science and Data Analysis' Package

Popular Comparisons

README

dask-memusage

If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:

So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.
In order to know where to focus memory optimization efforts.

dask-memusage is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.

dask-memusage polls your processes for memory usage and records the minimum and maximum usage in a CSV:

task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625

Usage

Important: Make sure your workers only have a single thread! Otherwise the results will be wrong.

Installation

On the machine where you are running the Distributed scheduler, run:

$ pip install dask_memusage

Or if you're using Conda:

$ conda install -c conda-forge dask-memusage

API usage

# Add to your Scheduler object, which is e.g. your LocalCluster's scheduler
# attribute:
from dask_memoryusage import install
install(scheduler, "/tmp/memusage.csv")

CLI usage

$ dask-scheduler --preload dask_memusage --memusage.csv /tmp/memusage.csv

Limitations

Again, make sure you only have one thread per worker process.
This is statistical profiling, running every 10ms. Tasks that take less than that won't have accurate information.