Changelog History
Page 1
-
v1.7.1 Changes
December 10, 2020π PyTorch 1.7.1 Release Notes
- π New Features
- π Critical Fixes
- π Other Fixes
π New Features
β Add Python 3.9 binaries for linux and macOS (#48133) and Windows (#48218)
NOTE: Conda installs for Python 3.9 will require the
conda-forge
channel, example:
conda install -y -c pytorch -c conda-forge pytorch
.β¬οΈ Upgrade CUDA binaries to use cuDNN 8.0.5 (builder repo #571)
β¬οΈ This upgrade fix regressions on Ampere cards introduced in cuDNN 8.0.4.
π It will improve performance for 3090 RTX cards, and may improve performance in other RTX-30 series card.π Critical Fixes
Python 3.9
- βͺ Use custom version of pybind11 to work around Python 3.9 issues (#48312)
- π Fix jit Python 3.9 parsing (#48744)
- π Fix cpp_extension to work with Python 3.9 (#48768)
π Build
- π Fix cpp_extension to properly handle env variable on Windows (#48937)
- π Properly package libomp.dylib for macOS binaries (#48337)
- π Fix build for statically linked OpenBLAS on aarch64 (#48819)
Misc
torch.sqrt
: fix wrong output values for very large complex input (#48216)max_pool1d
: fix for discontiguous inputs (#48219)collect_env
: fix detection of DEBUG flag (#48319)collect_env
: Fix to work when PyTorch is not installed (#48311)- π Fix
amp
memory usage when running inno_grad()
mode (#48936) - π
nn.ParameterList
andnn.ParameterDict
: Remove spurious warnings (#48215) - π Tensor Expression fuser bugfixes (#48137)
π Other Fixes
-
v1.7.1-rc3
December 07, 2020 -
v1.7.1-rc2
December 03, 2020 -
v1.7.1-rc1
November 21, 2020 -
v1.7.0 Changes
October 27, 2020π PyTorch 1.7.0 Release Notes
- Highlights
- Backwards Incompatible Change
- π New Features
- π Improvements
- π Performance
- π Documentation
Highlights
π The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to stable including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper.
A few of the highlights include:
- π CUDA 11 is now officially supported with binaries available at PyTorch.org
- π Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler
- π (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft
- π (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format
- π (Prototype) Distributed training on Windows now supported
π To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement here. Note that the prototype features listed in this blog are available as part of this release.
Front End APIs
[Beta] NumPy Compatible torch.fft module
π FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy.
π This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function.
Example usage:
\>\>\> import torch.fft\>\>\> t = torch.arange(4)\>\>\> ttensor([0, 1, 2, 3])\>\>\> torch.fft.fft(t)tensor([6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j])\>\>\> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j])\>\>\> torch.fft.fft(t)tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j])
- π Documentation | Link
π [Beta] C++ Support for Transformer NN Modules
π Since PyTorch 1.5, weβve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly.
- Documentation | Link
[Beta] torch.set_deterministic
β Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the
torch.set_deterministic(bool)
function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default.More precisely, when this flag is true:
- Operations known to not have a deterministic implementation throw a runtime error;
- π Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and
torch.backends.cudnn.deterministic = True
is set.
Note that this is necessary, but not sufficient , for determinism within a single run of a PyTorch program. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior.
π See the documentation for
torch.set_deterministic(bool)
for the list of affected operations.π Performance & Profiling
[Beta] Stack traces added to profiler
Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the autograd profiler as before but with optional new parameters:
with_stack
andgroup_by_stack_n
. Caution: regular profiling runs should not use this feature as it adds significant overhead.Distributed Training & RPC
π³ [Stable] TorchElastic now bundled into PyTorch docker image
Torchelastic offers a strict superset of the current
torch.distributed.launch
CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by settingmax_restarts=0
with the added convenience of auto-assignedRANK
andMASTER_ADDR|PORT
(versus manually specified intorch.distributed.launch)
.π³ By bundling
torchelastic
in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately installtorchelastic
. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflowβs distributed PyTorch operators.- Usage examples and how to get started | Link
π [Beta] Support for uneven dataset inputs in DDP
PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using
torch.nn.parallel.DistributedDataParallel
to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.β± [Beta] NCCL Reliability - Async Error/Timeout Handling
π In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before).
[Beta] TorchScript
remote
andrpc_sync
π
torch.distributed.rpc.rpc_async
has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs,torch.distributed.rpc.rpc_sync
andtorch.distributed.rpc.remote
. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment.β‘οΈ [Beta] Distributed optimizer with TorchScript support
β‘οΈ PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldnβt do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this.
π In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading.
π Currently, the only optimizer that supports automatic conversion with TorchScript is
Adagrad
and all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers and expect more to come in future releases. The usage to enable TorchScript support is automatic and exactly the same with existing python APIs, here is an example of how to use this:import torch.distributed.autograd as dist\_autogradimport torch.distributed.rpc as rpcfrom torch import optimfrom torch.distributed.optim import DistributedOptimizerwith dist\_autograd.context() as context\_id: # Forward pass.rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3)) rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1)) loss = rref1.to\_here() + rref2.to\_here() # Backward pass.dist\_autograd.backward(context\_id, [loss.sum()]) # Optimizer, pass in optim.Adagrad, DistributedOptimizer will# automatically convert/compile it to TorchScript (GIL-free)dist\_optim = DistributedOptimizer( optim.Adagrad, [rref1, rref2], lr=0.05, ) dist\_optim.step(context\_id)
[Beta] Enhancements to RPC-based Profiling
π Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made:
- π Implemented better support for profiling TorchScript functions over RPC
- Achieved parity in terms of profiler features that work with RPC
- β Added support for asynchronous RPC functions on the server-side (functions decorated with
rpc.functions.async_execution)
.
π User are now able to use familiar profiling tools such as
with torch.autograd.profiler.profile()
andwith torch.autograd.profiler.record_function,
and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions.π [Prototype] Windows support for Distributed Training
π PyTorch 1.7 brings prototype support for
DistributedDataParallel
and collective communications on the Windows platform. In this release, the support only covers Gloo-basedProcessGroup
andFileStore
.
π¨ To use this feature across multiple machines, please provide a file from a shared file system ininit_process_group
.# initialize the process groupdist.init\_process\_group( "gloo", # multi-machine example:# Shared files need six "/"# init\_method = `"file://////{machine}/{share_folder}/file"`# Local file need three "/"init\_method="file:///{your local file path}", rank=rank, world\_size=world\_size)model = DistributedDataParallel(local\_model, device\_ids=[rank])
- Design doc | Link
- π Documentation | Link
- Acknowledgement | gunandrose4u
Mobile
π¦ PyTorch Mobile supports both iOS and Android with binary packages available in Cocoapods and JCenter respectively. You can learn more about PyTorch-Mobile here.
π [Beta] PyTorch Mobile Caching allocator for performance improvements
π On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard,
c10::WithCPUCachingAllocatorGuard
, to enable the use of cached allocation within that scope.Example usage:
#include \<c10/mobile/CPUCachingAllocator.h\>..... c10::CPUCachingAllocator caching\_allocator; // Owned by client code. Can be a member of some client class so as to tie the// the lifetime of caching allocator to that of the class...... { c10::optional\<c10::WithCPUCachingAllocatorGuard\> caching\_allocator\_guard; if (FLAGS\_use\_caching\_allocator) { caching\_allocator\_guard.emplace(&caching\_allocator); } .... model.forward(..); } .....
NOTE : Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds wonβt be effective.
Backwards Incompatible changes
Python API
torch.conj
now returns the input as-is for real Tensors (#43270)π Previously,
torch.conj
andTensor.conj
were making a clone for Tensors of real dtype. It now returns the Tensor as-is to improve performance.
π― You can recover the original behavior by adding a.clone()
for real Tensors.
Note that this behavior is different fromnumpy
for whichnp.conj
returns a new ndarray andndarray.conj
returns the ndarray as-is.1.6.0 1.7.0 >>> t.is_complex() False >>> t.conj() is t False
| >>> t.is_complex() False >>> t.conj() is t True π― >>>t.conj().clone() is t False
|π
torch.tensor
,torch.as_tensor
, andtorch.sparse_coo_tensor
now use the input Tensorβs device when it is not specified (#41984)π This will change the device on which the Tensor is created and so the user can start seeing device mismatch errors.
π It also means for sparse Tensors that both of the provided Tensors must be on the same device if the device is not specified.
You can recover the original behavior by passing thedevice
argument.1.6.0 1.7.0 >>> t.device device(type=βcuda:0β) >>> # tensor constructor >>> torch.tensor(t, dtype=torch.float32).device device(type=βcpuβ) π >>> # sparse constructor >>> torch.sparse_coo_tensor( torch.tensor(([0], [2]), device="cpu"), torch.tensor(([1.],), device="cuda"), size=(3, 3, 1)).device device(type='cuda', index=0)
| >>> t.device device(type=βcuda:0β) >>> # tensor constructor >>> torch.tensor(t, dtype=torch.float32).device device(type=βcuda:0β) >>> # Specify the device to get the same behavior as 1.6 >>> torch.tensor(t, dtype=torch.float32, device='cpu').device device(type=βcpuβ) π >>> # sparse constructor >>> torch.sparse_coo_tensor( torch.tensor(([0], [2]), device="cpu"), torch.tensor(([1.],), device="cuda"), size=(3, 3, 1)).device RuntimeError: backend of indices (CPU) must match backend of values (CUDA) >>> # Specify the device to get the same behavior as 1.6 >>> torch.sparse_coo_tensor( torch.tensor(([0], [2]), device="cpu"), torch.tensor(([1.],), device="cuda"), size=(3, 3, 1), device="cuda:0").device device(type='cuda', index=0)
|torch.nn.utils.pack_padded_sequence
: remove hidden cross-device copy forlengths
(#41984)π In previous versions, when the lengths argument was a CUDA tensor, it would incorrectly be moved to the CPU silently.
π This can lead to surprising performances and CPU/GPU sync when using CUDA so this has been removed.
You need to make sure that the providedlenghts
is a CPU Tensor when it is provided as a Tensor.1.6.0 1.7.0 >>> inp = torch.rand(10, 2, 3, device="cuda") >>> lengths = torch.tensor([10, 7], device="cuda") >>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths) π >>> # Implicitly move lengths to the CPU and runs fine | >>> inp = torch.rand(10, 2, 3, device="cuda") >>> lengths = torch.tensor([10, 7], device="cuda") >>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths) RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor >>> # Ensure the lenghts is already on the right device >>> lengths = lengths.cpu() >>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths) π >>> # Runs fine with no implicit move across device |
π Improve
torch.norm
handling ofkeepdim=True
(#41956)0οΈβ£ Before this change, when calling
torch.norm
withkeepdim=True
andp='fro'
orp=number
, leaving all other optional arguments as their default values, the keepdim argument would be ignored. It is now properly respected.
Also, any timetorch.norm
was called withp='nuc'
andkeepdim=True
, the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. It is now properly keeping all the dimensions.
You can recover the original behavior by settingkeepdim=False
.
π NOTE: this function is now deprecated (see below) and we recommend you usetorch.linalg.norm
, which follows NumPyβs conventions.1.6.0 1.7.0 >>> t.size() torch.Size([4, 4]) >>> t.norm(p=βfroβ, keepdim=True).size() torch.size([]) >>> t.norm(p=3, keepdim=True).size() torch.size([]) >>> t.norm(p=βnucβ, keepdim=True).size() torch.size([1]) | >>> t.size() torch.Size([4, 4]) >>> t.norm(p=βfroβ, keepdim=True).size() torch.size([1, 1]) >>> t.norm(p=3, keepdim=True).size() torch.size([1, 1]) >>> t.norm(p=βnucβ, keepdim=True).size() torch.size([1, 1])
|torch.split
andtorch.chunk
: Fix view tracking for the autograd (#41567)π The autograd system is able to correctly handle modifications through views of Tensors by explicitly tracking known view operations. In prior releases,
torch.split
andtorch.chunk
were not marked as known view operations, which could lead to silently wrong gradients.π Note that since v1.5, inplace modification of views created by functions that return multiple views is deprecated. Such case is not properly handled by the autograd and can lead to internal errors or wrong gradients. So, as a side effect of this view fix, inplace modifications of the outputs of
torch.split
andtorch.chunk
will now raise a warning and can lead to internal errors or wrong gradients while they were previously silently computing wrong gradients.
π If you see such a warning, you should replace the inplace operation with an out of place one.
You can recover the original behavior by using the newtorch.unsafe_split
andtorch.unsafe_chunk
. Note that these functions are only here to ease the transition and will also be removed in a future version.torch.{argmin,argmax}
now always return the first min/max index (#42004)torch.argmin
(torch.argmax
) now always returns the index of the first minimum (maximum) element. This choice is consistent with NumPy. Previously if there were multiple minima (maxima) the index returned could be the index of any of them.
β‘οΈ You cannot recover the original behavior as it was platform dependent and not guaranteed. If your code was relying on a specific index for your specific platform, you should update it to work with the first index and this new code will work on all platforms.β‘οΈ
torch.{min,max,median}
: Update backward formula when doing full reduction (dim
argument not provided) (#43519)When no dimension is specified, full reduction is performed and the gradient will now flow back evenly towards all the input that realized the output value. The old behavior was to propagate the gradient only for one of such input selected arbitrarily.
This should improve stability of training by gradient descent.
To recover the previous behavior, you can perform the reduction with thedim=
argument. It will ensure that the gradient only flows back for the input whose index was returned.1.6.0 1.7.0 >>> a tensor([3, 2, 3]) >>> a.max().backward() >>> a.grad tensor([0, 0, 1])
| >>> a tensor([3, 2, 3]) >>> a.max().backward() >>> a.grad tensor([0.5, 0, 0.5]) >>> a.max(dim=0).max(dim=0).max(dim=0).backward() >>> a.grad tensor([0, 0, 1])
|β
nn.BCELoss
size mismatch warning is now an error (#41426)π This is the end of the deprecation cycle for this op to make sure it does not have different broadcasting semantic compared to numpyβs broadcasting semantic used everywhere else in PyTorchβs codebase.
You need to make sure all inputs are the same size to avoid the error.1.6.0 1.7.0 >>> bceloss = nn.BCELoss() >>> a = torch.rand(25) >>> b = torch.rand(25, 1) >>> bceloss(a, b) β UserWarning: Using a target size (torch.Size([25, 1])) that is different to the input size (torch.Size([25])) π is deprecated. Please ensure they have the same size. tensor(1.0604)
| >>> bceloss = nn.BCELoss() >>> a = torch.rand(25) >>> b = torch.rand(25, 1) >>> bceloss(a, b) ValueError: Using a target size (torch.Size([25, 1])) that is different to the input size (torch.Size([25])) π is deprecated. Please ensure they have the same size. >>> b = b.reshape(25) >>> bceloss(a, b) tensor(1.0604) |Custom
autograd.Function
stop materializingNone
output Tensors (#41490)π To improve performance, the custom
autograd.Function
will not create a Tensor full of zeros when an input is differentiable but the userβsbackward
function returnsNone
for it. This means that code for which the.backward()
orautograd.grad()
final result will now beNone
while it used to be a Tensor full of zeros.
You can recover the previous behavior by having your customautograd.Function
materialize the zero Tensor withtorch.zeros_like(input)
to replace theNone
output for thebackward
method.import torch# Custom Function that returns None for the gradientclass GetTwos(torch.autograd.Function): @staticmethoddef forward(ctx, inp): return inp.clone().fill\_(2) @staticmethoddef backward(ctx, grad\_out): # To recover the 1.6 behavior, replace the line below with `return torch.zeros_like(grad_out)`return Nonea = torch.rand(10, requires\_grad=True)b = GetTwos.apply(a)b.sum().backward()print(a.grad)# In PyTorch 1.6 this will print# tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])# In PyTorch 1.7 this will print# None
π Fix inplace detection for non-differentiable outputs (#41269)
π We fixed a bug in the inplace detection code that was preventing the detection of some inplace operations for output that are not differentiable (like integer type Tensors).
This can lead to code that used to run fine to throw the error βa Tensor that was needed for backward was modified in an inplace operationβ.
π Such failure is true and the user code must be fixed to compute proper gradients. In general, this involves cloning the Tensor before modifying it inplace to make sure the backward pass can happen safely.import torcha = torch.rand(10, requires\_grad=True)with torch.no\_grad(): a[2] = 10b, ind = a.max(dim=0)# ind is 2 herewith torch.no\_grad(): t = torch.rand(10) t[4] = 10res = torch.max(t, dim=0, out=(torch.Tensor(), ind)) # ind becomes 4 here# This backward runs in 1.6 but will fail in 1.7b.sum().backward()print(a.grad)# tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])# The value is wrong is at index 4 while it should be at index 2# The issue is avoided by not modifying ind inplace by replacing the line# above with:# res = torch.max(t, dim=0, out=(torch.Tensor(), ind.clone()))
Add
__torch_functions__
for methods (#37091)Functions, slicing and Tensor methods will now properly preserve the subclass type when possible.
\>\>\> class SubTensor(torch.Tensor): ... pass\>\>\> type(torch.add(SubTensor([0]), SubTensor([1]))).\_\_name\_\_'SubTensor'\>\>\> type(torch.add(SubTensor([0]), torch.Tensor([1]))).\_\_name\_\_'SubTensor'
The old behavior of βany operations on your subclass produces a torch.Tensor instead of the subclassβ can be recovered by doing:
from torch.\_C import \_disabled\_torch\_function\_implclass SubTensor(torch.Tensor): \_\_torch\_function\_\_ = \_disabled\_torch\_function\_impl
π For all details on how to use this feature, please refer to the doc page for it.
tensor. __iter__
: Usetorch.unbind
instead of a for loop (#40884)π This improves performances significantly but it changes the behavior of in-place operations on the value returned by the iterator. This happens only if either the input Tensor or any argument of the in-place operation is a Tensor that requires gradients. And it will fail with "Output X of UnbindBackward is a view and is being modified inplace".
You can recover the previous behavior by manually slicing the Tensor:[t[i] for i in range(t.size(0))]
as shown in the example below.1.6.0 1.7.0 >>> x = torch.randn(5, 10, requires_grad=True) >>> for i, v in enumerate(x): >>> v.fill_(i) | >>> x = torch.randn(5, 10, requires_grad=True) >>> for i, v in enumerate([x[j] for j in range(x.size(0))]): >>> v.fill_(i) |
β‘οΈ Updated most function that take zero, one or two Tensor arguments and indexing op to check for memory overlap in the Tensor being worked on (#43418, #43419, #43420, #43421, #43423, #43422)
β‘οΈ It fixes silent correctness errors: something that used to be silently incorrect now errors out. Code that raises this error must be updated to avoid doing such op that was returning wrong results as shown in the example below:
\>\>\> x = torch.randn(1, 3)\>\>\> # Create a tensor that has internal memory overlap\>\>\> y = x.expand(2, 3)# In 1.6, this would not error out, but in 1.7, this errors out\>\>\> torch.nn.functional.elu(y, inplace=True)RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.# Here is the fix in 1.7\>\>\> torch.nn.functional.elu(y, inplace=False)
c++ API: Any external users of
TensorIterator
now always get the memory overlap check. The previous behavior can be recovered by settingset_check_mem_overlap(false)
when creating the iterator.TorchScript
π» TorchScript now correctly supports various exception type and custom exception message (#41907)
- π Exceptions raised in TorchScript was traditionally replaced with a generic runtime error that doesnβt carry exception type or message, leading to crashes that are difficult to pin-point and debug. We improved TorchScript to correctly parse exception types and messages and surface them to users.
- π» This change is backward incompatible because TorchScript now attempts to compile user code that creates custom exception messages instead of ignoring them. Any TorchScript-incompatible Python features used in those code snippets would lead to failures.
- β‘οΈ There is no fixed formula to fix this backward incompatibility failure other than updating code that generates exceptions to be TorchScript-able.
π TorchScript now supports properties of TorchScript classes and ScriptModules (#42389, #42390)
- π TorchScript added support for
@property
of TorchScript classes and ScriptModules. Custom setters and getters are also supported. Custom deleters are not supported. - π This improvement is backward incompatible because TorchScript now attempts to script properties of existing classes and
Modules
. If these properties use Python or Pytorch features that are not supported in Torchscript, scripting will fail. - β‘οΈ There are two ways of fixing backward incompatibility failures introduced by this change. One is using
@torch.jit.unused
to annotate problematic properties, the other is to update the implementation of the property so that the getter and setter are scriptable.
Quantization
π The convolution parameters now support versioning.
- πΎ This change means that any quantized convolution module saved using PyTorch 1.7+ cannot be loaded in v1.6 and lower.
- But this change is backward compatible: if the model (with conv layers) is saved in version 1.6, it can be safely loaded in version 1.7.
π Some undocumented functions that were mistakenly made public have been removed
- π
torch.absolute_
has been removed, the Tensor method (Tensor.absolute_
) should be used instead just like all other inplace ops. torch.ExtraFilesMap
is an internal jit construct and should not be used.
β‘οΈ TorchScript Compiler Update
In 1.7, we are enabling a Profiling Executor and a new Tensor-Expressions-based (TE) Fuser. All compilations will now go through one (an adjustable setting) profiling run and one optimization run. For the profiling run, complete tensor shapes are recorded and used by the new Fuser. For the optimization run, the focus is on finding (in
torch.jit.ScriptModule
s) and fusing element-wise operations over CUDA tensors into a single CUDA kernel.The TE fuser is expected to deliver performance similar to the old fuser used in 1.6. It however unlocks more opportunities for performance improvements in future releases. In rare cases, performance of some models may degrade 5-10%. If you experience any regressions please report it on Github, so we can address them as soon as possible! For 1.7, we are providing an option for our users to revert back to the old fuser by calling
torch._C._jit_set_profiling_executor(False)
in Python andtorch::jit::getExecutorMode()
= false;
in C++. For more information, please see βGraph Executorβ section in our documentation.π Deprecations
Python API
π
torch.norm
andtorch.functional.norm
are deprecated in favor oftorch.linalg.norm
(#44321)The new
torch.linalg.norm
has the same behavior asnumpy.linalg.norm
π Both deprecated functions had odd behaviors for matrix and vector norms. You should refer to the doc here to find the exact behavior they had and how to replicate it with the new API.π Deprecate fft functions in
torch.
namespace in favor oftorch.fft.
namespace (#44876)Please use
torch.fft.foo
as a drop-in replacement fortorch.foo
for the following functions:fft
,ifft
,rfft
andirfft
.Warns when some
out=
functions need to resize an output which is not 0-size (#42079)π This behavior is dangerous and leads to an API that is hard to use. It is being deprecated to be able to fix that API in future versions.
You should resize the output before-hand to avoid any issue in the future:a = torch.rand(5)b = torch.rand(25)# This is deprecatedtorch.add(a, a, out=b)# This has the same behavior but will work in future versionstorch.add(a, a, out=b.resize\_(0))
torch.optim
: Warn for duplicate params in param group (#41597)π Providing multiple times the same Parameter in a single param group is most likely due to user error and is being deprecated.
Please open an issue if you have a valid use case that require this feature.π
torch.linspace
andtorch.logspace
: Not giving the step argument is deprecated (#43860)π The default
steps
argument that has been used historically in PyTorch is not consistent with other libraries and so is being removed to avoid confusion.
For both functions, passingsteps=100
keyword argument can be used to recover the original behavior.1.6.0 1.7.0 >>> torch.linspace(0, 10).size() torch.Size([100])
| >>> torch.linspace(0, 10).size() β UserWarning: Not providing a value for linspace's π steps is deprecated and will throw a runtime error π in a future release. torch.Size([100]) >>> torch.linspace(0, 10, steps=100).size() torch.Size([100]) |Distributed
- 0οΈβ£ Make TensorPipe the default backend for RPC (#43246)
- 0οΈβ£ Infer RPC backend type to preserve backward compatibility as we make TensorPipe the default (#45065)
- β Add deprecation warning to ProcessGroup backend and make TensorPipe backend stable. (#45356)
- β Add warnings on
ProcessGroup
andProcessGroup::Work
APIs which will be retired soon. (#46366)
π New features
Python API
π New namespaces:
π New operators:
torch.count_nonzero
added (#39992)nn.SiLU
activation added (#41034)torch.logit
added (#41062)torch.gcd
,torch.lcm
added (#40651, #41552, #42254)torch.functional.atleast_{1d/2d/3d}
added (#41317)torch.isreal
added (#41298)nn.Unflatten
added (#41564)- π
torch.movedim
added (#41480) torch.isposinf
,torch.isneginf
added (#41588)torch.signbit
added (#41589)torch.absolute
added (#42586)torch.clip
alias added (#42770)torch.quantile
added (#42755)torch.linalg.det
andtorch.outer
alias added (#42802)torch.nansum
added (#38628)torch.hypot
added (#42291)torch.nextafter
added (#42580)torch.hstack
,torch.vstack
,torch.dstack
added (#42799)torch.arccosh
alias added (#43107)- π
Tensor.movedim
as a method added (#43122) torch.matrix_exp
added (#40161)torch.fix
alias added (#43326)torch.arccos
,torch.arcsin
,torch.arctan
aliases added (#43319)torch.negative
alias added (#43400)torch.maximum
,torch.minimum
added (#42579)torch.arctanh
,torch.arcsinh
aliases added (#43762)torch.linalg.norm
added (#42749, #43907)torch.amax
,torch.amin
added (#43819)torch.heaviside
added (#42523)torch.i0
added (#43132)torch.not_equal
,torch.greater
,torch.greater_equal
,torch.less
,torch.less_equal
aliases added (#43870)torch.exp2
added (#44184)torch.kaiser_window
added (#44271)torch.nanquantile
added (#44393)torch.multiply
,torch.divide
aliases added (#44463)nn.TripletMarginWithDistanceLoss
added (#43680)torch.fft.fft
,torch.fft.ifft
,torch.fft.rfft
,torch.fft.irfft
,torch.fft.hfft
,torch.fft.ihfft
added (#43011)torch.fft.fftn
,torch.fft.ifftn
,torch.fft.rfftn
,torch.fft.irfftn
added (#44550)optim.functional.adagrad
added (#44715)optim.functional.adam
added (#44791)torch.complex
,torch.polar
added (#39617)Tensor. __complex__
added (#43844)torch.vdot
added (#43004)
API extension:
- π
torch.full
added support for bool and integer dtypes (#41912) - π
torch.lt
andtorch.masked_select
added support for half dtype (#43704) - π
torch.div
,torch.true_divide
,torch.atan2
added support for integer to float type promotion in (#42359) - π
unflatten
added support for non-named dimensions (#42563) - π
torch.polygamma
added support for n >= 2 (#42499) - π
torch.qr
added backward support for wide input matrices (#42216) - π
nn.Linear
for MKLDNN added support for no-bias (#43703) - π
torch.lerp
added support for half dtype (#43541) - β‘οΈ Updates
torch.div
to perform true division (end of deprecation cycle) (#42907) - π
torch.scatter
added support for reductions on CUDA (#41977) - π BFloat16 support type promotion (#41698, #43324)
- π BFloat16 support on CUDA for
torch.pow
(#44760), unary ops and activations (#44813, #44824, #44834),torch.i0
(#44750),softmax
(#44837),div
,addcdiv
,addcmul
,mean
,var
(#44758),layernorm
(#45002),all pooling layers (#44836, #45151)),torch.logspace
(CPU and CUDA) (#44675), random kernels on Windows (#44918),torch.addmm
,torch.addmv
(#44986), loss functions (#45011), batched gemm (#45167), nccl path (#38515), binary logical operators (#42485),torch.neg
(#45240), Conv (non-cuDNN) (#45007),torch.abs
(#44804),torch.erfinv
(#43399), comparison ops (#44748) - π
torch.asin
,torch.neg
added support for sparse Tensors (#44028) - π
torch.softmax
added support for CUDA (#42307) Tensor.{real,imag}
added setter for these attributes (#39860)- π
torch.{addmm,addmv}
added support for complex on CUDA (#40431, #43827) - π
torch.bmm
added support for complex on CPU #42383, - π
torch.{dot, vdot}
added support for complex (#42745) - π
torch.stft
,torch.istft
added support for complex (#43886) - π
torch.cholesky
added support for complex (#44895, #45267) - π
torch.sgn
added (to support complex) (#39955) - π Binary ops added support for complex (#43174)
- β Add allowlist for complex backward (#45461)
Autograd
- Don't automatically materialize output grads with zeros for
autograd.Function
(#41821) - Benchmark tool for
autograd.functional
API (#43428) - β Added
reset_grad
API to remove gradient instead of setting them to zero (#44423) - π Allow Tensor-like objects in
torch.autograd.gradcheck
(#43877) - β Added support for nested call for
@torch.no_grad()
decorator (#44633) - β Added support for
torch.lobpcg
backward (#43002)
CUDA
- β Added TF32 support (#41498)
- π CUDA RTX30 series support (#45489, #45130)
- **Note: **At the time of the 1.7 release, the currently available and stable Nvidia CUDA libraries are not fully tuned for the RTX 3080 and 3090 so users might see performance regressions.
- π
torch.cuda.amp.GradScaler
now supports sparse gradients (#36786) - π Autocast support for cudnn RNNs (#42385)
- π Support AMP in nn.parallel (#43102)
- π Support for tf32 in cudnn and
backends.cudnn.allow_tf32
flag to control it (#40737) - Added
torch.cuda.memory.list_gpu_processes
to list running processes on a give GPU (#44616) - β Add env variable to bypass CUDACachingAllocator for debugging (#45294)
- β Add non-deterministic alert to CUDA operations that use
atomicAdd()
(#41538)
C++ API
nn::TransformerEncoderLayer
added (#42633)nn::TransformerDecoderLayer
added (#42717)nn::TransformerEncoder
added (#43187)nn::TransformerDecoder
added (#42886)nn::Transformer
added (#44333)nn::Unflatten
added (#42613)nn.ParameterList
added (#41259)- π
torch::cuda::manual_seed
andtorch::cuda::manual_seed_all
added (#42638)
Mobile
- π Support Tensor MemoryFormat in java wrappers (#40785)
- β Add
mobile_optimized
boolean flag to optimized model. (#45479)
Vulkan
- Backend added (#36491, #43076)
- Add many operators
adaptive_avg_pool2d
(#41220),mm
(#41221),reshape
(#41223),max_pool2d
(#41379),add_
andrelu_
(#41380),cat
(#41434),add
andmul
(#42674) andavg_pool2d
(#42675). - Model preparation via
torch.utils.optimize_for_vulkan
(#44903) - β Add to Java API option to load on Vulkan and test app (#44896, #44897)
Distributed
- π Support alltoall collective in ProcessGroupGloo (#41424, #41690)
- β Add a DDP Communication Hook providing the flexibility to completely override DDP gradient communication (#40848)
- Examples on how to use the DDP communication hook (#43310)
- β Add NCCL Alltoall to NCCL process group (#42514)
- π Support allgather and gather APIs for Python Objects (#42189)
- π Join-based API to support uneven inputs in DDP (#42577)
- broadcast_object API for c10d (#43887)
- π Async Error Handling support for ProcessGroupNCCL (#41050, #41051, #41052, #41053, #41054, #44163)
- Add a βgradient_as_bucket_view" parameter to DDP to reduce memory overhead (#44344)
- β Add getNumKeys API to c10d TCPStore (#43962)
- β Add DeleteKey API for c10d TCP Store (#45401)
Quantization
- π New quantized ops
- Adaptive average pooling (#40271)
- Max pooling (#45152)
- Embedding and EmbeddingBag quantization (8-bit + partial support for 4-bit): (#40076, #41293, #41612, #42924, #42762, #42881, #43077, #43088, #43090, #43176, #43296, #43433, #43989, #44008, #44207, #44208, #44217, #45149, #44845, #44048, #42690, #42612)
- QNNPACK Transposed convolution2D and 3D (#39714, #40351, #40360, #40370, #40371, #44844, #45078, #45081)
- Operations on quantized tensors
aten::repeat
(#40644)aten::apend
(#40743)stack
(#42187)fill_
(#43303)clone
for per channel affine quantized tensor (#44573)append
(graphmode) (#44641)- 1D batch normalization support (#42491)
- N-Dimensional constant padding (#43304)
- CELU operator (#39199)
- π Support for FP16 quantization (#40708, #40709, #40710, #42147, #42221, #42222, #42348, #41049)
- β Add Quantizer support to IValue (#42438)
- π Custom module support (#44835)
- Preserving pre and post forward hooks (#37233)
Misc
torch.set_deterministic
andtorch.is_deterministic
: Raise error when the flag is set and a non-deterministic operation is used (#15359, #41377)- β Add CUDA 11 to nightly binaries (#44086, #43366)
- Dev Tool: Nightly checkout tool and doc in
CONTRIBUTING.md
(#42635, #43294) - π Website: Add docs for tagged version (include rc) on the general website (#45204)
- π Build: Added BUILD_CAFFE2 flag to be able to disable caffe2 compilation (#43673)
- Dataloader: Add
prefetch_factor
argument to control the number of batch loaded ahead of time(#41130) - Dataloader: Allow handling of
np.memmap
objects (#39847) - π ROCm: Add support torch
utils.cpp_extension
(#41257, #43528) - ROCm: Enable complex BLAS (#43744)
- π³ docker: Add torchelastic to docker image (#45438)
- π³ docker: Add CUDA 11 support (#45071)
- π³ docker: Use python 3.8 in pytorch docker image (#45466)
π Improvements
Python API
- π Use tree-based sum for floats to avoid numerical instability (#39516)
- π
nn.ReflectionPad
: Add support for 0-dim batch sizes. (#39231) torch.scatter
: Add reductions for CPU (#36447)- π Allow any valid ASCII python identifiers as dimnames (#40871)
- π Improve Python warning prints when there is also an error (#41116)
- π¨
torch.iinfo
,torch.finfo
: Improve printing (#40488) - π
torch.where
: Add support for scalar input (#40336) - π
torch.nonzero
: Remove deprecation warning foras_tuple
argument (#45413) torch.distributions.Categorical
: Clamp logit to avoid-inf
when calculating entropy (#41002)torch.futures.Future
: Adddone
function to query the status of the future (#42013)
torch.nn
nn.EmbeddingBag
: Add support forincude_last_offset=True
when reduction is mean or max (#42215)nn.AvgPooling{1,2,3}d
: Ensure all cells are valid in ceil mode to avoid division by 0 (#41368)nn,[Adaptive]MaxPool{1,2,3}d
: Handle edge case when input is filled with -inf (#40665)nn.Hardsigmoid
,nn.Hardswish
: Add inplace option (#42346)- π
nn.MSELoss
,nn.L1Loss
,nn.SmoothL1Loss
: Add support for target that requires gradients. (#44437, #44471, #44486) - β
nn.Parameter{List,Dict}
: Add warning when improperly used (with DataParallel or weight_norm) (#44405) nn.functional.smooth_l1
: Add beta parameter (#44433)
π Build
- Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off. (#40146)
- π Raise nice error when trying to build PyTorch on 32-bit Windows system (#40321)
- π Make setup.py Python-2 syntactically correct and work for version >= 3.9 (#41960, #46388)
- π Don't proceed into setup.py too far if Python version is unsupported (#42870)
Distributed
- π Support profiling rpc_async in TorchScript (#40652)
- π Allow RPC to be initialized again after shutdown. (#42723)
- π Support rpc_sync, rpc.remote in TorchScript (#43043, #43046)
- π Make async_execution compatible with RRef helpers (#44666)
- π Extend RPC profiling to support async function execution over RPC. (#44664)
- π Support record_shapes in RPC profiling (#44419)
- β Add variants for cuda.comm.broadcast/gather/scatter which store the result in a provided βoutβ parameter (#39681)
- Explicitly abort NCCL Communicators on ProcessGroupNCCL Destruction (#40585)
- π¨ Helper function to print out all DDP-relevant env vars (#41297)
- β Add timeout to ProcessGroup Work Wait (#40944)
- π Support Wait Timeout in ProcessGroupNCCL (#40946)
- π Support work-level timeouts in ProcessGroupGloo (#40948)
- π Support for torch.bool in ProcessGroupNCCL (#41959)
- DDP.train() returns self to stay consistent with nn.Module (#42131)
- β Add a drop_last option in DistributedSampler to drop tail of the data to ensure data is even across ranks (#41171)
- β Additional error checking for
torch.cuda.nccl
APIs. (#43247) - π Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970)
- β Add a device parameter to RemoteModule (#44254)
- β Add remote_parameters() API for RemoteModule. (#43906)
- β Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
TorchScript
- π Support string concatenation (cc29c19)
- π Support using Python Enum in TorchScript (#41390,#41965,#42085,#42623,#42661,#42661,#42874,#43460,#43188,#44243,#44891)
- π Support sorting list of strings (#42398)
- π Support boolean key in dictionary (#42833)
- π Support
@torch.no_grad
(#41371) - π Support
del
to TorchScript classes (#44352) - Speed up saving modules in case of having many classes (#44589)
- π Support Python Slice class in TorchScript (#44335)
- π Support sorting a list of tuples (#43448)
- Enable
@torch.jit.unused
syntax for ignoring properties (#45261) - Enable ProfilingExecutor + TensorExpression (#45546) (#45546)
- π Support
@torch.jit.unused
on a@torch.no_grad
decorated function (#41496) - π Improve ModuleList indexing error msg (#43361)
- π Better match behavior of loaded
ScriptModule
`s vs. freshly created ones (#43298) - π Support backend-lowered submodules (#41146)
- π Allow freezing of modules containing interface attribute (#41860)
to_backend
API now accepts wrapped modules (#43612)- π Allow submodule methods inference rules to be different (#43872)
- π Support default values for arguments of class type methods (#45098)
- π Improve sugared value's error message when closing over global variables (#42889)
- π Support backend-lowered submodules (#40841)
- Turn on non-ASCII string literals serialization (#40719)
- π Better printing of Tensor stride information (#45156)
Mobile
- π Allow specifying PYTHON executable to build_android (#41927)
- π Include all overloads for OSS custom build (a01e91e)
Quantization
- π Change the
whitelist
toallowlist
(#41771, #41802) - π
dequantize
now supports list and tuple of tensors (#41079) - User now has a way to add a activation post process hook using
register_activation_post_process_hook
function (#42342) - π
add
/mul
now support different variants (#42769) - π¨ Fake quantizer now has more info when printed (#43031)
OP_LIST_TO_FUSER_METHOD
is exposed to the user (#43286)quantize_jit
can handle new upsample overloads (#43407)- Setter/getter method for quantization and fusion mappings (#43990)
- fake_quant and observer can be disabled in scriptmodule (#44773)
convert_jit
can now takepreserved_attrs
argument (#44490)- π
SyncBN
: preserve qconfig if it exists (#45317) - β Add quant APIs to save/load observer
state_dict
(#44846) - β Add version support for the
conv
parameters (#43524, #43086, #43651, #44671)
ONNX
In PyTorch 1.7, we have continued to add and improve PyTorch operator export to ONNX. We have enabled export of 10 new operators, and further enhanced and optimized export of 10+ torch operators to ONNX. We have also focused on improving export of TorchScript modules, in particular laying some groundwork required for better support in near future. We have also created an API (torch.onnx.utils._find_missing_ops_onnx_export) as a diagnostic tool (preview only) to get a list of operators in a model that are not supported or implemented by ONNX exporter. Support for export of torch.quantization.FakeQuantize has also been added to help enable some QAT workflows.
Add support to export more torch ops
torch.view_as
(#40496), fake quantize functions (#39738), embedding_bag (#41234, #44693),torch.eye
(#41357),Tensor.as_strided
(#41569),torch.tensor
(#41872), addition between list of tensors (#41888),Tensor. __floordiv__
(#43022),torch.nn.KLDivLoss
(#41858),Tensor.new_empty
andTensor.new_zeros
(#43506)π Improves existing export logic and optimizing exported ONNX graph
- Add warning in ONNX export when constant folding is on in training-amenable mode (#40546)
- Fix export of
torch.full_like
(#40063) - Add pass that fuses Conv and BatchNormalization (#40547)
torch.where
export, add support for ByteTensor (#42264)- Fix scalar type cast for comparison ops (#37787)
torch.scatter
export, add support for src being scalar or different dtype (#42765, #43440)- Fix Squeeze operator when applied to a dimension with shape > 1 (#38476)
- Extend support for
torch.where
(#41544)
- Update ops
torch.slice
(#42935),torch.split
(#43670),torch.repeat
(#43430),torch.arange
(#43777),len
(#43824),torch.narrow
(#44039), flatten (#40418), adaptive_pool (#46100)β‘οΈ Update export to follow pytorch changes
- Update div export to perform true divide (#44831)
- Enable true_divide scripting export with ONNX shape inference (#43991)
Misc
torch.utils.collect_env
: Collect more informations (python 32/64bit, clang version, CPU architecture, ROCm version) (#42887, #42961, #44106)torch.hub.load_local
: Allow to load models from any local directory (#44204)- β Add warning if
import torch
is called from the source root (#39995) - π Improve Dynamic Library loading for Windows (#40365)
- π serialization: validate sparse tensors after loading (#34059)
- β Add
--continue-through-error
option to run_test.sh script (#41136) - Tensorboard: Support custom
run_name
and `hparam\_domain\_discrete
inadd\_hparams
(#40660, #40720) - MKLDNN: Enable conv3d, batchnorm3d, max_pool3d and avg_pool3d (#40691, #40995, #40996)
- Profiler: Do not record zero duration kernel events (#41540)
- Profiler: Improve cuda time counting (#45209)
- Profiler: Adding
with_source
parameter to enable tracking source code (#43898) - β± Optim: Add verbose param for all schedulers (#41580)
- Pruning: check attributes before deleting (#41913)
- Autograd: In
zero_grad
, avoid using inpalcedetach
when it is not required (#41283) - β‘οΈ Autograd: Update the
torch.div
backward formula to improve numerical stability (#43627) - π¨ Autograd: Print all traceback for higher order backwards in detect_anomaly (#43626)
- Autograd: Stop saving input of
torch.repeat
as onlyinput.dim()
is needed in backward (#40766) - CUDA: Improve cuDNN error messages to include call parameters (#45023)
- CUDA: Improve
device_count
and cuda init error detection and messages (#42249) - π Improve Tensor layout propagation for pointwise ops to follow input layout more closely (#42922)
- β Remove blacklist/whitelist references (#41447, #41644, #41636, #41777, #41822, #41691, #41789, #41979, #41627, #42011, #41796, #42067, #42091, #42097, #42071, #42089, #42279, #42047, #42088, #45260)
Python Type Annotations
- β‘οΈ Update some types in top level
torch/*.py
(#40235, #40873) - Added typing for
Tensor
attributes and methods:T
andgrad_fn
(#40879),Tensor._version
(#41125),ndim
(#42909),nonzero
(#43053), #40499) - β Added typing for
torch.serialization
(#40862) - β Added typing for
torch.tensor
(#45077) - β Added typing for
torch.Size
(#40879) - β Added typing for
torch.futures
(#41675) - β Added typing for
torch.random
(#42234) - β Added typing for
torch.hub
(#42252) - β Added typing for
collect_env.py
(#43062) - β Added typing for
torch.utils
(#39392, #42647, #42711, #42960, #43806, #44136, #44216) - β Added typing for
torch.nn
(#43044, #44093, #43080, #42231, #40669) - β Added typing for
torch.sparse
(#43108) - β Added typing for
torch.cuda.nvtx
(#43443) - β Added typing for
torch.cuda.memory
(#43444) - β Added typing for
torch.functional
(#43446) - β Added typing for
torch.autograd
(#44451, #46206) - β Added typing for
torch.quantization.fuse_modules
(#43786) - β Added typing for
torch.nn.quantized
(#43186, #44154, #43110) - β Added typing for
torch.testing._internal
submodules (#44575, #44805, #44832, #44911, #44927, #44985, #44971, #45107, #45368, #45375) - β Added typing for
torch.backends.quantized
(#44794) - β Added typing for
torch.backends.cuda
(#44916) - β Added typing for
torch.cuda.{comm,nccl,amp}
(#45350, #45344, #45480) - β Added typing for
torch.quasirandom
(#45434) - π Fix typing for
jit.trace
andonnx.export
(#41093) - π Fix typing for
torch/optim/lr_scheduler.pyi
(#41775, #41866)
π Bug fixes
Python API
torch.linspace
: Fix step computation for large integral types (#40132)torch.pca_lowrank
: Fix un-expected memory consumption (#40853)torch.linspace
: Fix behavior for non-contiguous inputs on CPU (#41286)torch.div
: Fix division by low precision scalar (#41446)torch.expm1
: disable mkl as it produces wrong values in some cases (#41654)torch.utils.data.RandomSampler
: Stop generating samples one at a time when replacement=True (#41682)torch.nn.functional.grid_sample
: Fix 64-bit indexing (#41923)torch.nn.functional.grid_sample
: Fix crash whengrid
has NaNs (#42703)torch.det
: Fix on CPU (#35136)torch.interpolate
: Avoid zero division in cubic mode (#42093)torch.fmod
: Fix to work with zero divisors consistently (#41948)torch.masked_select
: Fix for discontiguous outputs (#41841)torch.cummin
,torch.cummax
: Fix for discontiguous inputs/outputs (#42507)torch.einsum
: Fix for discontiguous inputs (#42425)torch.orgqr
: Fix input size conditions (#42825)- π
torch.manual_seed
: Fix argument unpacking (#42206) torch.searchsorted
: Properly mark output as non differentiable (#42933)torch.bucketize
: Properly mark output as non differentiable (#44102)- π©
torch.addmm
: Properly raise error on device mismatch (#43505) torch.chain_matmul
: Properly handle empty args (#43553)torch.multinomial
: Properly handle 0 size dim (#43775)torch.cholesky_solve
: Fix broadcast and error checking (#43137)- π
torch.movedim
: Fix uniqueness check (#44307) torch.min
,torch.max
,torch.mean
: Properly throw error if dim is repeated (#44281)torch.lerp
: Fix for discontiguous outputs on CUDA (#44559)torch.addmv
,torch.mv
: Fix beta=0 case in slow path (#44681)torch.triangular_solve
: Fix error check on CPU (#44720)torch.empty_like
,torch.zeros_like
: Properly raise error if any memory format is provided with sparse input (#44058)torch.atan2
: Fix type promotion (#43466)torch.repeat
: Fix backward for 0 size repeats (#45212)torch.min
,torch.max
,torch.median
: Fix handling of nan in backward (#45280)torch.rdiv
: Properly make it consistent with div (#45407)torch.std
: Fix hanling of nan in backward (#45468)torch.distributions.Binomial
: Fix CUDA sampling at extreme points (#42702)- π
torch.dot
,torch.vdot
: Add complex support (#45074) torch.pow
: Fix when scalar base is complex (#45259)torch.round
,torch.abs_
: Disable complex inputs (#45330)torch.svd
: Fix memory corruption for complex inputs (#45486)torch.view_as_complex
: Fix zero dimensional input (#44175)torch.kthvalue
: Fix for non-contiguous input (#46177)torch.save
: Fix python binding that could lead to out of bound read (#46207)
Torch.nn
nn.ModuleDict
: Fix input dict key ordering (#40905)nn.LayerNorm
: Fix handling ofgamma
in the backward whencreate_graph=True
(#41595)nn.functional.{max,avg}_pool{1,2,3}d
: Raise RuntimeError for zero stride (#41819)nn.Module
: Fix missing attribute when loading model from older version (#42290)nn.Embedding
: Raise proper error for 0-D weight (#42550)- π
nn.SyncBatchNorm
: Fix forward pass for non-default process group (#43861) nn.functional.embedding_bag
: Fix for non-contiguous weight (#44032)nn.functional.upsample
: Add nondeterministic checks (df6ea62)nn.GroupNorm
: Fix bug when input does not require_grad on CUDA (#44863)functional.{l1_loss,smoothl1_loss,mse_loss}
: Properly check that reduction strings are valid (#43527)- π©
functional.smoothl1_loss
: Properly raise error for negativebeta
values (#45759) functional.pad
: Fix extra memory allocation and invalid result for negative or zero pad when using circular padding (#39273)
C++ API
nn::MultiheadAttention
: Ensure all parameters are properly registered (#42037)Tensor::grad
: Fix the thread safety issues (#40887)Tensor::var
: Ensure thatvar(0)
does not call thevar(bool keepdim)
overload butvar(int dim)
(#40451)
Distributed
- π Fix RPC and ProcessGroup GIL deadlock (#45088)
- Relax size check in flatten_for_scatter_gather (#40573)
- BAND, BOR and BXOR for NCCL all_reduce should throw runtime errors (#42669)
- Disallow creation of ProcessGroupNCCL without GPUs (#45642)
- π Fix read/write of bulk data (#42504)
- π Fix thread safety issue with distributed optimizers and TorchScript (#46071)
TorchScript
- π Fix type annotations in select assignments (#40528)
- π Fix compilation issues with GCC-5.4 (#41055, #41063, #43223)
- π Fix JIT not round to even if constant is folded (#40897)
- π Fix
torch.jit.freeze
import (#42319) - π Fix
List[str].index
(#40348) - π Fix
torch.jit.is_tracing()
so that it is correctly called rather than returning the method itself (#42486) - π Fix Str -> Device implicit conversions (#43213)
- π Fix
NaN
propagation in fuser's min/max implementation (#43590) - Cast return values of functions returning Any (#42259)
- π Fix
NaN
propagation in TensorExpression fuser's min/max implementation (#43609) - π Fix segfault in attribute lookup on loaded
ScriptModules
(#43284) - π Fix casting of
unsigned char
, andabs(int)
(#44157) - π Fix frac in CUDA fuser (#44152)
- π Fix model_name not logged properly issue. (#45488)
- π Fix
len
,contains
,getitem
inherited from interface class derived from nn container (#40789) - π Fix support for FP16 in CudaCodgen (#44209)
- π Fix
torch.tensor
for empty multidimensional-typed lists (#44652) - π Fix freeze_module pass for sharedtype (#42457)
- π― Correctly clone schema in
insert_observers
(#40624) - π Fix value association with dictionaries in the tracer (#40885)
- π Fix preserve submodule attribute in freezing (#45143)
- π Fix Half conversion of immediates in NNC Cuda backend (#45213)
- π Fix a bug in
SplitWithMask
when splitting multiple times (#45141) - π Fix inlining interface call in fork subgraph (#43790)
- π Fix operator order in combineMultilane in TensorExpr fuser(#45157)
- Correctly mark Tensor types inferred from empty annotation as
inferred=True
(#45360) - π Fix some bugs in Round+Mod simplification in NNC (#42934)
- Fix
set_grad_enabled
scripted version (#46060) - π Fix for
dict.update()
scripted version (#46105) - π Fix segfault when scripting nested classes (#46422)
- π Fix memory leak in Profiling Mode (#46621)
Quantization
- Resolved namespace conflict in qnnpack for init_win symbol (a7e09b8)
- π Fix linking of qnnpack params on windows. (#40920)
- β Adding zero point type check for per channel quantization (#40811)
- Remove activation_post_process in qat modules (#42343) (#43015)
qlinear_dynamic
: Fix ASAN error in QNNPACK's integration. (#41967)- π Change quantizer to account for input tensor's memory format. (#42178)
- π Fixing the output shape for the linear (#44513)
- Ensure observers and fq modules are scriptable (#44749)
- histogram observer: ensure buffer shape consistency (#44956)
- Attach qconfig to all modules (#42576)
- π Fix qnnpack quantized activations for NHWC memory format (#46217)
ONNX
- π Fix crash when exporting a model with
nn.Sequential
(#19227) - π Fix default
ignore_index
for nll loss (#44816) - π Rename Black to Block for various files (#42913)
- π Fix bug in
onnx::SsaRewrite
(#42148)
Misc
- π Fix
torch.hub
for new zipfile format. (#42333) - Preserve python backtrace in autograd engine errors. (#43684)
- π
optim.SparseAdam
: Fix check that params are dense on init (#43668) - π Fix clang build (#44934)
nn::MultiheadAttention:
Fix parameter registration (#42037)- MaxPool2D: Fix memory leak for XNNPACK (#41874)
- π Numpy scalar detection for bool and complex types fixed (#43644)
- β Add missing file to
BUILD.bazel
(#40536) - π
autograd.gradcheck
: Add support for complex (#43208) - π Fix bug in mobile-specific CPU caching allocator (#43719)
π Performance
Python API
torch.{view_as_complex,view_as_real}
: Remove unnecessary temporary Tensor (#44908)- π
tensorboard.SummaryWriter.add_audio
: Remove unnecessary for loops (#44201) Conv2d
andConv3d
: bypass the im2col for 1x1 conv (#40324)- π Fix
max_pool2d
perf regression (#41174) - Disable the mkldnn for
conv2d
in some special cases (#40610) addmm
: Reduce constant time overhead (#41374)- π
cumsum, cumprod
: Enable non-synchronizing cub scan for cum* operations (#42036) - π
max_pool2d
: CUDA NCHW performance improvement (#42182) arenge
: Vectorize CPU implementation (#38697)- β‘οΈ
istft
: optimize by using col2im (#42826) - π
LayerNorm
: improved performance on CPU both forward and backward (#35750) - π
silu
: improved performance (#42976) - π
addmv
: improved performance for zero sized input cases (#41824) - Mobile: Simple caching allocator for CPU (#42006)
- π
MaxPool1d
: improved performance for cases without indices (#43745) adaptive_avg_pool2d:
optimized code path for cases when output size is (1, 1) (#44211)- Vectorized complex copy (#44722)
- β‘οΈ
cat
: optimized cuda kernel (#44833) - Vectorized int8_t on CPU (#44759)
- Vectorized
bitwise_not
(#45103) - β Added stateful XNNPack deconvolution2d operator to torch (#43233)
- Enabled mkldnn dilation convolution (#40483)
Distributed
- π Skip allreducing
local_used_maps_dev_
whenfind_unused_param=False
in DDP to improve performance (#40407) - β Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
- β Add option to run NCCL operations on high priority cuda stream (#43796)
- β¨ Enhance DistributedOptimizer to be functional and torchscriptable to avoid GIL and global lock (#45221)
TorchScript
- JIT pass for add relu fusion. (#39343)
- β‘οΈ Optimize autodiff subgraph slicing (#41437)
- Don't re-run CSE on every block (#41479)
- β Add loop unroll optimization in NNC (#42465)
- Speed up CUDA kernel launch when block/thread extents are statically known (#42899)
- π Support merging adjacent fusion groups in TensorExpression Fuser. (#43671)
- β Add passes to profiling executor pipeline (#43636)
- π Improve performance of
KernelSumMultipleAxes
(#43905) - π€ Latency improvements for pointwise + reduction fusion (#45218)
- β Add simplification of Loop + Condition patterns in NNC (#44764)
- π Fix fallback graph in specialize autogradzero (#44654)
- π Fix masking for all block and thread dimensions in CudaCodeGen (#44733)
- π Improve performance of simple reduction and softmax in nvFuser (#40864)
- β Add a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write. (#42606)
- Fuse identical conditions in NNC simplifier (#44886)
- β Add _out variants and reuse memory in static runtime(#44128)
Mobile
- β‘οΈ Add add_relu fusion pass to optimize_for_mobile. (#40252)
- optimize_for_mobile: bring packed params to root module (#42740)
- π Apply selective build on RNN operators (#44132)
- β Add neon backend for vectorization (#39341)
Quantization
- Use the _min_max function instead of two separate calls for min and max(#41570, #42957, #44537)
- π Improve performance of the QNNPACK kernels (#41342, #42007, #42008)
- π Speed up HistogramObserver by vectorizing critical path (#41041)
- Speed up AdaptivePool3d by checking if input is ChannelsLast or ChannelsLast3d (#42780)
- observers: use clamp instead of min/max in calculate_qparams (#43150)
- observers: use torch.all to check for valid min and max values (#43151)
- Avoid resizing in MinMaxObserver (#43789)
- observers: make eps a buffer (#43149)
Misc
- π ROCm: Fix performance issues with
torch.cat
(#46323)
π Documentation
Python API
- Numerous typo and grammatical improvements (#39854, #40217, #40285, #40544, #40692, #40617, #41025, #41031, #40984, #41066, #41203, #41263, #41384, #41526, #41563, #41632, #41643, #41599, #41799, #41679, #41835, #41851, #41963, #42016, #42076, #41946, #42046, #42065, #42236, #42184, #42734, #42923, #42891, #43063, #43131, #43395, #43588, #43583, #43697, #43779, #43569, #43893, #43695, #43973, #44667, #44753, #44740, #45045, #45192, #43308, #40334)
- β Remove use of term βblacklistβ (#41450)
- β Add overflow notice for cuFFT on half precision (#40551)
- β Add complex Note (#41012, #41252, #40450)
- β Add documentation about data sharing for Tensors during serialization (#40412)
- β Add
nn.Module.training
to docs (#40923) nn.CrossEntropyLoss
: Clarify that the mean argument is weighted (#40991)- β‘οΈ
torch.scatter_
: Update doc with support for reduction methods. (#40962) - π Fix HTTP links in documentation to HTTPS (#40878)
- π Fix warnings when building docs (#41068, #41334, #41335, #44686)
- β Add PyTorch Glossary (#40639)
- π Fix documentation references following page split (#39086)
- β‘οΈ Update serialization note to explain versioned symbols and dynamic versioning (#41395)
- π Make elementwise comparison docs more consistent (#41626)
- β‘οΈ Update CONTRIBUTING.md to explain how to use ccache (#41619)
- β Add doc warning for LSTM non-deterministic behavior (#40893)
- 0οΈβ£ Document default dim for cross being None (#41850)
- π Clarify Python 3.6 is the minimum supported version in the installation section. (#41937)
- Split quantization subsection into smaller pages (#41321)
- π Documentation for
torch.optim.swa_utils
(#41228) - π Improve the documentation of DistributedDataParallel (#42471)
- β‘οΈ Update docs about CUDA stream priority (#41364)
- π Update the documentation for
torch.scatter
to include streams parameter. (#42814) - β‘οΈ Update
Tensor.clone
doc (#42931, #43098) - β‘οΈ Update external links in the README.md (#43100)
- Update
torch.Tensor.is_set_to
documentation (#43052) - π Polish the nightly pull docs in CONTRIBUTING (#43494)
- π Update the
torch.qr
documentation to include a warning about when the QR.backward is well-defined. (#43547) - β‘οΈ Update the instructions to build from source on windows (#43479, #45553)
- Document the beta=0 behavior of BLAS functions (#43823)
- π Fix docs for kwargs-only functions (#43586, #43589)
- Documents
torch.sub
properly, addstorch.subtract
alias (#43850) - π Update determinism documentation (#41692)
- β‘οΈ Update instructions to build (#42850)
- β Clarify
nn.Batchnorm
track_running_stats
docs (#44445) - π Fix latex error in
torch.heaviside
docs (#44481) - β‘οΈ Update
torch.median
doc to explain returned value for even-sized input (#44562) - π Fix the
nn.ELU
formula in the docs (#43764) - π
torch.min
,torch.max
: remove incorrect warning from docs (#44615) - π Reference
torch.cuda.amp
tutorial from core amp docs (#44725) - π Mention TF32 on related docs (#44690)
- Clarify that 5-D 'bilinear' grid_sample is actually trilinear (#45090)
- β‘οΈ Update linalg warning + docs (#45415)
- π Update
torch.floor_divide
documentation to clarify it's actuallytorch.trunc_divide
(#45411) - β‘οΈ Update
torch.fft
doc and make warning clearer (#45409) - β‘οΈ Update for complex autograd (#45270, #46281)
- β‘οΈ Update
nn.Flatten
docs (#42084)
Distributed
- β Add a CONTRIBUTING.md for the distributed package. (#44224)
- β Added docs for Store API (#45543)
- Add
all_gather_object
andgather_object
documentation (#43772)
TorchScript
- π Fix
torch.jit.trace_module
documentation (#40248) - π Fix the docs for the inputs arg of
torch.jit.trace_module
(#41586) - Add documentation for
PYTORCH_JIT_TYPE_VERBOSITY
(#42241) - Grammatical corrections in JIT overview (#43473)
- β‘οΈ Update docs for recently added JIT features, including Enum Support,
torch.no_grad
etc. (#45232) - β Add function signature for
pixel_shuffle
(#45661) - π Fix signature for
torch.poisson
in documentation (#45656)
Mobile
- Aar native linking add fbjni (#40578)
- π fix scripts (#44464)
- [PyTorch Mobile] Move some string ops to register_prim_ops.cpp and make them selective (#44500)
Quantization
- π Fix several quantization documentation typos (#40567, #43693)
- API summary section (#45848)
- π Documentation for dynamically quantized RNN cells (#40896)
Misc
- π Update ONNX docs for release (#45086)
-
v1.7.0-rc4
October 23, 2020 -
v1.7.0-rc3
October 21, 2020 -
v1.7.0-rc2
October 15, 2020 -
v1.7.0-rc1
September 30, 2020 -
v1.6.0 Changes
July 28, 2020π PyTorch 1.6.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- π Deprecations
- π New Features
- π Improvements
- π Bug Fixes
- π Performance
- π Documentation
Highlights
π The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.
A few of the highlights include:
π 1. Automatic mixed precision (AMP) training is now natively supported and a stable feature - thanks to NVIDIAβs contributions; π 2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
- New profiling tools providing tensor-level memory consumption information; and π¦ 4. Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.
β Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here.
[Stable] Automatic Mixed Precision (AMP) Training
π AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported
torch.cuda.amp
API, AMP provides convenience methods for mixed precision, where some operations use thetorch.float32
(float
) datatype and other operations usetorch.float16
(half
). Some ops, like linear layers and convolutions, are much faster infloat16
. Other ops, like reductions, often require the dynamic range offloat32
. Mixed precision tries to match each op to its appropriate datatype.[Beta] TensorPipe backend for RPC
π PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.
# One-line change needed to opt intorch.distributed.rpc.init\_rpc( ... backend=torch.distributed.rpc.BackendType.TENSORPIPE, )# No changes to the rest of the RPC APItorch.distributed.rpc.rpc\_sync(...)
[Beta] Memory Profiler
The
torch.autograd.profiler
API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.Here is an example usage of the API:
import torchimport torchvision.models as modelsimport torch.autograd.profiler as profilermodel = models.resnet18()inputs = torch.randn(5, 3, 224, 224)with profiler.profile(profile\_memory=True, record\_shapes=True) as prof: model(inputs)# NOTE: some columns were removed for brevityprint(prof.key\_averages().table(sort\_by="self\_cpu\_memory\_usage", row\_limit=10))# --------------------------- --------------- --------------- ---------------# Name CPU Mem Self CPU Mem Number of Calls# --------------------------- --------------- --------------- ---------------# empty 94.79 Mb 94.79 Mb 123# resize\_ 11.48 Mb 11.48 Mb 2# addmm 19.53 Kb 19.53 Kb 1# empty\_strided 4 b 4 b 1# conv2d 47.37 Mb 0 b 20# --------------------------- --------------- --------------- ---------------
Distributed and RPC Features and Improvements
[Beta] DDP+RPC
π PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users canβt mix and match these to try out hybrid parallelism paradigms.
π Starting PyTorch 1.6, weβve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.
// On each trainerremote\_emb = create\_emb(on="ps", ...)ddp\_model = DDP(dense\_model)for data in batch: with torch.distributed.autograd.context(): res = remote\_emb(data) loss = ddp\_model(res) torch.distributed.autograd.backward([loss])
[Beta] RPC - Asynchronous User Functions
π RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the
@rpc.functions.async_execution
decorator; and 2) Let the function return atorch.futures.Future
and install the resume logic as callbacks on theFuture
object. See below for an example:@rpc.functions.async\_executiondef async\_add\_chained(to, x, y, z): return rpc.rpc\_async(to, torch.add, args=(x, y)).then( lambda fut: fut.wait() + z )ret = rpc.rpc\_sync( "worker1", async\_add\_chained, args=("worker2", torch.ones(2), 1, 1) ) print(ret) # prints tensor([3., 3.])
- Tutorial for performant batch RPC using Asynchronous User Functions| Link
- π Documentation | Link
- Usage examples | Link
[Beta] Fork/Join Parallelism
π This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.
Parallel execution of TorchScript programs is enabled through two primitives:
torch.jit.fork
andtorch.jit.wait
. In the below example, we parallelize execution offoo:
import torchfrom typing import Listdef foo(x): return torch.neg(x)@torch.jit.scriptdef example(x): futures = [torch.jit.fork(foo, x) for \_ in range(100)] results = [torch.jit.wait(future) for future in futures] return torch.sum(torch.stack(results))print(example(torch.ones([])))
- π Documentation | Link
Backwards Incompatible Changes
β¬οΈ Dropped support for Python <= 3.5 (#39879)
β¬οΈ The minimum version of Python we support now is 3.6. Please upgrade your Python to match. If you use conda, instructions for setting up a new environment with Python >= 3.6 can be found here.
π Throw a RuntimeError for deprecated
torch.div
andtorch.addcdiv
integer floor division behavior (#38762, #38620)π In 1.5.1 and older PyTorch releases
torch.div
,torch.addcdiv
, and the/
operator perform integer floor division. In 1.6 attempting to perform integer division throw a RuntimeError, and in 1.7 the behavior will change so that these operations always perform true division (consistent with Python and NumPy division).To floor divide integer tensors, please use
torch.floor_divide
instead.1.5.1 1.6.0 >>> torch.tensor(3) / torch.tensor(2) β ../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer π division of tensors using div or / is deprecated, and in a future π release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. tensor(1) | >>> # NB: the following is equivalent to >>> # torch.floor_divide(torch.tensor(3), torch.tensor(2)) >>> torch.tensor(3) // torch.tensor(2) tensor(1) |
The fix for
torch.addcdiv
is similar.1.5.1 1.6.0 >>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> torch.addcdiv(input, tensor, other, value=value) β ../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning: π Integer division with addcdiv is deprecated, and in a future π release addcdiv will perform a true division of tensor1 and tensor2. The current addcdiv behavior can be replicated using floor_divide for integral inputs (self + value tensor1 // tensor2) and division for float inputs (self + value tensor1 / tensor2). The new addcdiv behavior can be implemented with true_divide (self + value torch.true_divide(tensor1, tensor2). tensor(0) | >>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> (input + torch.floor_divide(value tensor, other)) tensor(0) |
π Prevent cross-device data movement for zero-dimension CUDA tensors in binary pointwise PyTorch operators (#38998)
π In previous versions of PyTorch, zero dimensional CUDA tensors could be moved across devices implicitly while performing binary pointwise operations (e.g. addition, subtraction, multiplication, division, and others). For example,
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
π would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6.
To perform binary pointwise operations on data of different devices, please cast the tensors to the correct device by using
Tensor.to
:Version 1.5.1 Version 1.6.0 >>> torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1') torch.tensor([6, 6], device='cuda:1') | >>> torch.tensor(5, device='cuda:0').to('cuda:1') + torch.tensor((1, 1), device='cuda:1') torch.tensor([6, 6], device='cuda:1') |
β¬οΈ Dropped support for CUDA 9.2 on Windows
β¬οΈ In previous versions of PyTorch, we provided an installation option for Windows environments running CUDA 9.2. Starting from PyTorch 1.6.0, we are no longer providing those binaries. Please upgrade your CUDA version to 10.1 or 10.2 and install a PyTorch binary for one of those CUDA versions instead.
π PyTorch release binaries dropped dedicated bytecode for CUDA compute capability 6.1; removed PTX for CUDA compute capability 3.7
To check whether you are affected, please find your GPU in a table inthis link.
π If you are using a Nvidia GPU with compute capability 6.1, you may notice a performance hit when using the release binaries (installed via pip or conda). We stopped building for CUDA compute capability 6.1 but PyTorch programs should still continue to work with those devices. If you do notice a performance hit, a workaround is to compile PyTorch from source.
π If you are using a Nvidia GPU with compute capability 3.7 and relied on PTX, we have dropped support for that in our release binaries (installed via pip or conda). Potential workarounds are: install a previous version of PyTorch or to compile PyTorch from source.
π Changed how bool tensors are constructed from non-bool values to match Python, C++, and NumPy (#38392)
In previous versions of PyTorch, when a bool tensor is constructed from a floating-point tensor, we would first convert the tensor to a long tensor, then to float tensor. This is not consistent with how bools are interpreted in Python, C++, and NumPy (just to name a few), which interpret 0 floating-point values as False and everything else as True.
If you were relying on the previous behavior, the following code will achieve the same effect.
Version 1.5.1 Version 1.6.0 >>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2], dtype=torch.bool) tensor([True, True, False, False, False, True, True]) | >>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2]).long().bool() tensor([True, True, False, False, False, True, True]) |
Throw RuntimeError when torch.full would infer a float dtype from a bool or integral fill value (#40364)
π In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.
Enabled thread parallelism for autograd on CPU (#33157)
In previous versions of PyTorch, running
.backward()
in multiple threads causes them to be serialized in a specific order, resulting in no parallelism on CPU. In PyTorch 1.6.0, running.backward()
in multiple threads no longer serializes the execution and instead autograd will run those in parallel.This is BC-breaking for the following two use cases:
- If any weights are shared among threads, gradient accumulation that was previously deterministic may become non-deterministic in 1.6 as two different threads will write to the .grad attribute in a non-deterministic order.
- If you use any C++ hooks, those are not guaranteed to be thread-safe. Please change them to be thread-safe.
π In more detail, in 1.6.0, when you run
backward()
orgrad()
via python, TorchScript or the C++ API in multiple threads on CPU, you should expect to see extra concurrency. For example, you can manually write multithreaded Hogwild training code like:# Define a train function to be used in different threadsdef train\_fn(model, input): # forwardy = model(input) # backwardy.sum().backward() # potential optimizer update# define your model in python or in TorchScriptmodel = Model()# User write their own threading code to drive the train\_fnthreads = []for \_ in range(10): # define or load the datainput = torch.ones(5, 5, requires\_grad=True) p = threading.Thread(target=train\_fn, args=(model, input)) p.start() threads.append(p)for p in threads: p.join()
Note when you use the same model and call
backward()
concurrently in multiple threads, model parameters are automatically shared across threads. The gradient accumulation might become non-deterministic as two backward calls might access and try to accumulate the same .grad attribute. Although we do proper locking to avoid data corruption, we don't guarantee the order in which the ops are executed, so non-determinism might arise, but this is an expected pattern in multithread training. You could use the functional APItorch.autograd.grad()
to calculate the gradients instead of backward() to avoid the non-determinism.For thread safety:
- The custom Python/C++ Autograd Functions (both forward and backward) are properly protected and are guaranteed to be thread safe in 1.6.0.
- For hooks, both Python/C++ hooks will run concurrently. Note that in C++, just like in regular C++ threading, you will need to do proper locking when writing shared objects, so previous custom C++ hooks might not work nicely under a multithreaded environment in 1.6.0. In Python, just like in regular python threading, you can read/write objects safely but the order (and thus determinism) is not guaranteed.
π Change autograd gradient accumulation logic to yield
.grad
s that match the weights' memory layout (#40358)π In previous versions of PyTorch, autograd would yield contiguous gradients. Now, gradients have the same memory layout as their respective weights. This should result in silent performance improvements. Since PyTorch operators generally support non-contiguous tensors, this should have no functional effect on most PyTorch programs. A known exception is when accessing
param.grad
and performing an operation that requires a contiguous tensor, such asparam.grad.view(-1)
. In this case, you will receive an error as follows:
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead
.If a user wants to force accumulation into a grad with a particular layout, they can preset
param.grad
to a zeroed tensor with the desired strides or manually set grad to have the desired strides (param.grad = param.grad.contiguous(desired format)
.)π See the below section on βNote: BC-breaking memory format changesβ for more details.
π Change memory format promotion rules of pointwise operators (#37968)
In previous versions of PyTorch, performing a binary pointwise operation between a Contiguous and a Channels Last tensor produced a Channels Last. In PyTorch 1.6, this now returns a tensor with the layout of the first operand.
π See the below section onβNote: BC-breaking memory format changesβ for more details.
Note: BC-breaking memory format changes
π Operations that now return tensors in a different memory format generally should have no functional effect on most PyTorch programs because PyTorch operators generally support non-contiguous tensors.
The most common incompatibility with Python programs is with the
view
operator, which has specific stride requirements. If these requirements are no longer met as a result of this change, you will get an error message indicating that you should usereshape
instead, i.e. "RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."π» Another possible exception incompatibility is if you have a (usually) C++ operator implementation that works directly on memory (i.e. calls data_ptr and relies on the strides being contiguous).
nn.functional.interpolate
:recompute_scale_factor
default behavior changed fromTrue
toFalse
(#39453)In PyTorch 1.5.1 and older versions,
nn.functional.interpolate(input, size, scale_factor, ..., recompute_scale_factor)
has a default ofrecompute_scale_factor = True
. In PyTorch 1.6, weβve changed the default torecompute_scale_factor = False
.Depending on the precision of the
scale_factor
, this may result in an output tensor with different values than before. To retain the old behavior, simply change your code to userecompute_scale_factor = True
.More concretely, what
recompute_scale_factor = True
means is, if the user passes in ascale_factor:
- We will first compute the new output size; and
- Then, we will compute a new
scale_factor
by dividing the output size by the input size and sending it to an internal helper function. - The new
scale_factor
is used in the interpolate computation but in some cases is different from thescale_factor
the user passed in.
This behavior resulted in loss of precision so we deprecated it in PyTorch 1.5.0. In PyTorch 1.6 and onward,
recompute_scale_factor
has a default ofFalse
, which means that we pass it directly to an internal helper function.out=
arguments of pointwise and reduction functions no longer participate in type promotion (#39655)In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,
out = torch.add(a, b)
could produce a different result than
torch.add(a, b, out=out)
This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.
π Changed
torch.quasirandom.SobolEngine(..., scramble=True, seed=None)
to respecttorch.manual_seed
when a seed has not been provided (#36427)π In previous versions of PyTorch,
SobolEngine(..., scramble=True, seed=None)
did not respect any calls totorch.manual_seed
. The expected behavior for random number generation functions is to respect the seed set bytorch.manual_seed
, so weβve changedSobolEngine
to match.π If you were relying on the old behavior where
SobolEngine
ignorestorch.manual_seed
, please explicitly pass a different seed toSobolEngine
:Version 1.5.1 Version 1.6.0 π >>> torch.manual_seed(1337) π # SobolEngine ignores the manual_seed and instead uses its own. π >>>
x1 = SobolEngine(dimension=1, scramble=True, seed=None).draw(3)
| >>> import time π >>> torch.manual_seed(1337) π # To replicate the old behavior of, pass a seed to SobolEngine. >>> ms_since_epoch = int(round(time.now() * 1000)) >>> x1 = SobolEngine(dimension=1, scramble=True, seed=ms_since_epoch).draw(3) |Tensor.random_(to, from)
: Enforce check thatfrom
andto
are within the bounds of the Tensorβsdtype
(#37507)β In previous versions of PyTorch,
to
andfrom
did not have to be within the bounds of the tensorβsdtype
(this raised a warning). The behavior ofrandom_
in that case can be unexpected. We are making this a hard error starting from PyTorch 1.6.0; please modify your code if you run into the error.Version 1.5.1 Version 1.6.0 >>> tensor = torch.zeros(10, dtype=torch.uint8) # 256 is the maximum value for
to
fortorch.uint8
>>> tensor.random_(0, 257) β UserWarning: to - 1 is out of bounds for unsigned char. | >>> tensor = torch.zeros(10, dtype=torch.uint8) # 256 is the maximum value forto
fortorch.uint8
>>> tensor.random_(0, 256) |β¬οΈ Dropped support for CUDA < 9.2 from for source builds (#38977, #36846)
π If you build PyTorch from source, weβve dropped support for using CUDA < 9.2 (run
nvcc --version
to check your CUDA version). Users who install PyTorch packages via conda and/or pip are unaffected.DataLoader
βs__len__
changed to return number of batches when holding anIterableDataset
(#38925)In previous versions of PyTorch,
len(<instance of dataloader holding an IterableDataset>)
would return the number of examples in the dataset. Weβve changed it to be the number of batches (e.g., the number of examples divided by the DataLoaderβsbatch_size
) to be consistent with the computation of length when the DataLoader has a BatchedSampler.torch.backends.cudnn.flags
: deleted unusedverbose
flag (#39228)π The verbose flag did nothing, so we deleted it. If you were passing a value to
flags
forverbose
, please remove it.RPC
β±
RpcBackendOptions
takesfloat
instead oftimedelta
fortimeout
argument to stay consistent withtimeout
types in other TorchScriptable RPC APIs.# v1.5rpc.init\_rpc( "worker1", rank=0, world\_size=2, rpc\_backend\_options=rpc.ProcessGroupRpcBackendOptions( num\_send\_recv\_threads=16, datetime.timedelta(seconds=20) ) ) # v1.6rpc.init\_rpc( "worker1", rank=0, world\_size=2, rpc\_backend\_options=rpc.ProcessGroupRpcBackendOptions( num\_send\_recv\_threads=16, 20 # seconds ) )
TorchScript
0οΈβ£ The Default Executor Is Rolled Back To Legacy (#41017)
π We rolled back to the old fuser and the legacy executor in this release in order to recover some reported performance regressions. In future releases we plan to reach the same or better performance with a new redesigned executor and fuser.
π In order to switch back to the executor used in the 1.5 release one could use the following API:
- in Python: call
torch._C._jit_set_profiling_executor(True)
before you call your model for the first time, - in C++: include
#include <torch/csrc/jit/runtime/graph_executor.h>
and setgetExecutorMode() = true
before you invoke your model for the first time.
β Added dynamic versioning (#40279)
Note: this isnβt actually BC-breaking but we are listing it here because it is BC-Improving.
π The PyTorch Team recommends saving and loading modules with the same version of PyTorch. Older versions of PyTorch may not support newer modules, and newer versions may have removed or modified older behavior. These changes are explicitly described in PyTorchβs release notes, and modules relying on functionality that has changed may need to be updated to continue working properly.
π In this release, the historic behavior of
torch.div
andtorch.full
is preserved for models saved viatorch.jit.save
in previous versions of PyTorch. Modules saved with the current version of PyTorch will use the latesttorch.div
andtorch.full
behavior. See the notes above for the BC changes to those operators.Internals
The following are a list of BC-breaking changes to some of PyTorchβs internal components.
Dispatcher C++ API has had some spring cleaning. This is still considered an βinternalβ API, but it is becoming more public facing as it stabilizes.
- π Renamed callUnboxed() to call() in Dispatcher, OperatorHandle, KernelFunction (#37999)
- π The TensorId suffix has been removed from most DispatchKey enum entries (#36240)
- β Removed ::callOp(); use Dispatcher::call instead (renamed in #37797, removed in #38351, #38742)
- β Removed
KernelFunction::makeFromUnboxedFunctorFactory
; use makeFromUnboxedFunctor directly instead (#35488) - π Renamed boxing/unboxing files and utilities in ATen/core/boxing (#35411)
autograd.gradcheck
andautograd.gradgradcheck
: Added a new default-true argumentcheck_undefined_grad
(#39400)Internally, in the autograd engine, we use a special undefined Tensor value to represent zero-filled gradients and expect backward functions and user-defined
torch.autograd.Function
s to gracefully handle those values. Whencheck_undefined_grad
is True (the default for PyTorch 1.6+),gradcheck/gradgradcheck
test that the operation in question supports undefined output gradients. This may cause a previously succeedinggradcheck
to fail.You can turn the check off by setting
check_undefined_grad
to False. As long as autograd does not error out due to an undefined gradient in your model, then everything should be fine.Version 1.5.1 Version 1.6.0 >>> torch.autograd.gradcheck(my_custom_function, inputs) True | >>> # To keep the previous behavior >>> torch.autograd.gradcheck(my_custom_function, inputs, check_undefined_grad=False) True |
[C++ API] Changed the TensorIterator API (#39803)
π TensorIterator is an implementation detail for writing kernels that is exposed in our C++ API. Weβve modified how developers interact with TensorIterator, please see the Pull Request for more details.
Removed
torch._min
andtorch._max
(#38440)torch._min
andtorch._max
are undocumented and were intended to be an implementation detail; we expect very few users, if any at all, to be using it. Weβve deleted it in PyTorch 1.6.0. Please usetorch.min/torch.max
instead if you are usingtorch._min/torch._max
.π Deprecations
π Deprecated old
torch.save
serialization format (#39460, #39893, #40288, #40793)0οΈβ£ We have switched
torch.save
to use a zip file-based format by default rather than the old Pickle-based format.torch.load
has retained the ability to load the old format, but use of the new format is recommended. The new format is:- π more friendly for inspection and building tooling for manipulating the save files
- fixes a long-standing issue wherein serialization (
__getstate__
,__setstate__
) functions onModules
that depended on serializedTensor
values were getting the wrong data - the same as the TorchScript serialization format, making serialization more consistent across PyTorch
Usage is as follows:
m = MyMod()torch.save(m.state\_dict(), 'mymod.pt') # Saves a zipfile to mymod.pt
π To use the old format, pass the flag
_use_new_zipfile_serialization=False
m = MyMod()torch.save(m.state\_dict(), 'mymod.pt', \_use\_new\_zipfile\_serialization=False) # Saves pickle
π Fixed missing deprecation warning for Tensor.nonzero() (#40187)
Calling
torch.nonzero(tensor, as_tuple=False)
with one argument orTensor.nonzero(as_tuple=False)
with no arguments is deprecated and will be removed in a future version of PyTorch. Please specify theas_tuple
argument.π New Features
Python API
π New Utilities
- β Added global hooks to
torch.nn.Module
(#38972) - π Added option to enable cpp stack traces with
TORCH_SHOW_CPP_STACKTRACES=1
(#38127) - β Added
torch.utils.show_pickle
for showing pickle contents in saved models (#35168)
π New Operators
torch.logcumsumexp
added (#36308)torch.logaddexp
added (#38384)torch.rad2deg
,torch.deg2rad
added (#38852)torch.arccosh
,torch.arcsinh
,torch.arctanh
added (#38388)torch.flip{lr, ud}
added (#38599)torch.bucketize
,torch.searchsorted
added (#34577)torch.istft
(Inverse Short Time Fourier Transform) added (#35569)- π
torch.vander
: added support for generating Vandermonde matrices (#36725) torch.block_diag
added (#33449)nn.Hardswish
,nn.functional.hardswish
added (#34747)torch.nn.init.trunc_normal_
(truncated normal initializer) added (#32397)- β Added Stochastic Weight Averaging. See
torch.optim.AveragedModel
andtorch.optim.SWALR
for more details.(#35032)
C++ API
- β Added Optimizer
AdamW
to C++ frontend (#40009) - π Custom C++ autograd function now supports c10::optional as parameters (#37700)
- π torch::Tensor now supports bitwise NOT(!), AND(&), OR(|), XOR() operators (#38691)
- π Cpp extension now supports load and
load_inline
under ROCm (#35897)
π [Beta] Complex Tensor support
π The PyTorch 1.6 release brings beta-level support for complex tensors. The UX is similar to existing PyTorch tensors and the new complex-specific functionality is compatible with NumPyβs complex arrays. In particular, youβll be able to create and manipulate complex tensors, interop with previously existing code that represented complex tensors as tensors of size
(..., 2)
, and more.π While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorchβs ability to run on accelerators and work with autograd to better support the scientific computing and ML communities.
π Please find the full documentation here.
Python API:
- β Added
torch.is_signed()
for complex tensors. (#33773) - β Added dtype inference for complex tensors. (#33713)
- β Added
torch.randn
andtorch.normal_
for complex tensors. (#34037, #35056) - β Added complex type inference for
torch.full
. (#34709) - β Added type promotion logic for complex numbers. (#34093)
- β Added
is_complex
tensor attribute for complex numbers. (#34093) - β Added torch.fill for complex tensors. (#34973)
- β Added
torch.rand
for complex dtypes. (#34924, #35585) - π Fixed complex conversions, used in
torch.copy_
, on cuda. (#35344) - β Added
torch.from_numpy
for complex dtypes. (#35531) - β Added a check to throw error for in place modification of non-complex tensors with complex number values. (#35883)
- π Fixed
torch.exp
CPU implementation for complex tensors. (#35715) - β Added
torch.masked_fill
for complex tensors. (#36335) - β‘οΈ Updated
torch.abs
to return float tensors for complex tensors. (#35871) - β Added
torch.isfinite
andtorch.isinf
for complex tensors. (#36648) - β Added
torch.isclose
for complex tensors. (#36456) - β‘οΈ Updated
torch.angle
to return float tensors for complex tensors. (#36896) - Enabled
requires_grad
for complex tensors. (#36932) - π Fixed reciprocal divide for complex tensors. (#37193)
- β Added
torch.reciprocal
for complex tensors on CUDA. (#36749) - β Added Python API for
Complex Storage
. (#35771) - β Added
torch.addmv
for complex tensors. (#37924, #40238) - β‘οΈ Updated dtype inference for
torch.tensor
. (#38030) - β Added
torch.pow
for complex tensors on CUDA. (#36793) - β Added support for complex values as exponents in
torch.pow
.(#36793, #39117) - β Added
torch.roll
for complex tensors on CUDA. (#38664) - β Added
torch.gather
for complex tensors on CPU. (#36430) - β Added
torch.tanh
for complex tensors on CUDA. (#38786) - β Added complex dtypes to list of supported types in autograd. (#38325)
- β Added
torch.cumsum, torch.cumprod
for complex tensors on CUDA. (#39063) - β Added
real
andimag
views as tensor attributes. (#39033) - β Added
torch.flip
andtorch.rot90
for complex tensors. (#37826) - Added
torch.view_as_real
,torch.view_as_complex
for complex tensors. (#39099) - β Added printing logic for complex tensors (#40513, #38031)
- β Add
torch.tan
for complex tensors on CUDA (#38400) - β Added support for complex tensors for
torch.tanh
backward function (#37791, #38786)
C++ API:
- β Added core of c10::complex. (#36626)
- β Added overloads of std:: math functions in c10::complex (#37468, #37689)
- β Added c10::complex as the C++ type for complex tensors (#37421, #39306)
- β Added support for operations on c10::complex and integer scalars (#38418)
- β Added overloads for complex math functions in both :: and std:: to fix ROCm bugs (#39829)
- β Added
at::tensor()
andtorch::tensor()
for complex numbers (#39793)
Distributed
torch.distributed
: Addall_to_all
API to the MPI backend in the distributed module (#32361).- π
torch.distributed
: Addc10d
dynamic loading mechanism to support 3rd-partyc10d
implementations (#28068). torch.nn.parallel.DistributedDataParallel
: Add distributed data parallel benchmark tool (#35198).torch.nn.parallel.DistributedDataParallel
andtorch.distributed.rpc
: allow DDP to work with RPC (#37998, #39916, #40130, #40139, #40495).
Mobile
- β‘οΈ Add
torch.utils.mobile_optimizer.optimize_for_mobile
to encapsulate several model optimizations appropriate for mobile models. (Note: currently broken on Windows.) (#35227) (#36357)
π New operator registration API
PyTorch 1.6 has a new, pybind11-based operator registration API which replaces the torch::RegisterOperators() class.
Before:
static auto registry = torch::RegisterOperators("my\_ops::warp\_perspective", &warp\_perspective);
After:
TORCH\_LIBRARY(my\_ops, m) { m.def("warp\_perspective", warp\_perspective); }
You can read more about this API in the custom C++ operators tutorial or the reference documentation.
The new API was developed in PRs #35061, #35629, #35706, #36222, #36223, #36258, #36742, #37019. Internal code was ported to this API in #36799, #36800, #36389, #37834, #38014; you may find the code examples in these PRs helpful for your ports.
ONNX
In PyTorch 1.6, we have added support for ONNX Opset 12. We have also enhanced export of torchvision models, such as FasterRCNN, MaskRCNN, and KeypointRCNN to support dynamic input image size. Export support for several new ops have also been added. A new operator export mode, ONNX_FALLTHROUGH, has been added to the export API that allows exporting the model with non-standard ONNX operators. For large (> 2 GB) model export (using
external_data_format=True
argument), we now support models with large tensor data in attributes (not just model parameters).π New ONNX operator support:
- β‘οΈ Update Dropout Export (#37641)
- β‘οΈ Update Argmin/Argmax ONNX Export (#38329)
- π Fix pow op export (#38065)
- π Export Support for Celu (#38243)
- β Add GreaterOrEqual and LessOrEqual to opset 12 ONNX export (#38311)
- π ONNX Export Support for CrossEntropyLoss (#34830)
- β Adding 'numel' and 'to' export for script module (#36501)
- Support clamp_min and clamp_max (#37872)
- Quantization: Add
aten::max_pool2d
to onnx jit pass (#34912) - Quantization: Mark
upsample_nearest2d
, sigmoid and reshape as no scale in onnx (#36325) - Quantization: export of quantized models with new conv and linear API in onnx (#38736)
Quantization
π New quantization operators:
- quantized Conv1d (#35093, #36352, #38248, #38283, #38438, #38449, #38749)
- quantized hardsigmoid (#34959,#36351, #36698, #36699)
- quantized hardswish (#34820,#36350, #36252, #36320, #36545)
- quantized layernorm (#36593, #36690, #35693)
- quantized groupnorm (#36835, #39090)
- quantized instancenorm (#36847, #39091)
- quantized reflection_pad1d (#37452)
- quantized adaptive avgpool. (#36813)
- channel shuffle op fp32 + quantized. (#36815)
- qnnpack path for hardtanh (#35779)
- Quantized Threshold (#39352)
RPC
torch.distributed.rpc
: Add TensorPipe RPC backend (#36197, #35483, #37839, #37918, #37919,#37850,#37851, #37852,#37980, #38052, #38265, #38266, #40162, #40389, #37910, #38448, #38818, #38819, #38926, #38931, #38930, #38933, #38934, #39010, #39011, #39397)- π
torch.distributed.rpc
: Support per-RPC timeouts forrpc_sync
andrpc_async
(#34650) torch.distributed.rpc.functions.async_execution
: Add an@async_execution
decorator to allow pause and resume executions in RPC target functions (#39216, #39267, #39485, #39486, #39758).torch.futures.Future
:Expose aFuture
type to Python API (#39008, #37311, #39119, #39597, #39964, #39950)torch.distributed.rpc
: Allow profiler to be enabled remotely with RPC (#38748, #40066)torch.distributed.rpc
: Implement TorchScript-compatibleRemoteModule
API (#37139, #40173)torch.distributed.rpc.RRef
: enable retrying RRef control messages on communication failures (#33636)torch.distributed.rpc
: Let RPC usetorch._C.Future
instead of exposing a dedicated future type. No impact on user side (#35039)- π
torch.distributed.autograd
: Add profiler support forbackward
of the distributed autograd engine (#35261) - π
torch.distributed.rpc.RRef
: Add TorchScript support forRRef.local_value()
(#35433) - π·
torch.distributed.rpc.WorkerInfo
: Add TorchScript support forWorkerInfo
(#35447) torch.distributed.rpc
: Allow profiling RPC with TorchScript target functions (#36275)torch.distributed.rpc.RRef
: Add RRef Python Helper to launch function on the remotely referenced object (#36619)- β±
torch.distributed.rpc
: Add timeout argument to TorchScriptablerpc_async
(#37884) torch.distributed.rpc
: Enable RPC Server Global Profiler (#38847)- β±
torch.distributed.rpc
: Implement timeout support forrpc.remote
andRRef.to_here()
(#38590) - β±
torch.distributed.rpc
: Enable RRef timeout for TensorPipe (#39531) torch.distributed.rpc.WorkerInfo
: AddWorkerInfo
python__repr__
magic method (#40004)
TorchScript
- Fork / Join Async Parallelism (#40438)
- ScriptModule Freezing (#40409, #37044, #38830, #34786, #34787)
π Improvements
Python API
- β Added long description to wheel packages (#39676)
- π
torch.add
: Prevent unbounded growth while adding sparse tensors (#36030) - π
torch.mv
: enabled for sparse tensors (#21782) - π
torch.bmm
: enabled for sparse x dense tensor operations (#33430) torch.cat
: improved error message (#38978)- π
torch.masked_select
: enabled bfloat16 support (#36859) torch.absolute
: added as an alias fortorch.abs
(#36597)torch.device
: improved error message to includexla
as an acceptable device (#36446)- π
torch.linspace
,torch.logspace
: improved precision (#35461) Tensor.true_divide
method variant added (#34794)Tensor.isnan()
,Tensor.isinf()
,Tensor.isfinite()
method variants added (#37942)Tensor.is_nonzero
: improved error message (#38150)- π²
Tensor.cauchy_
, Tensor.log_normal_
,Tensor.exponential_
: added support for bfloat16 (#38427) Tensor.as_subclass
method added. (#34369)collect_env.py
: improved to detect relevant conda-installed numpy and cudatoolkit (#35646)- π
collect_env.py
: made it more robust on Windows (#39136) torch.utils.data
: Addgenerator=
kwarg for DataLoader & random samplers (#39737)torch.utils.data.DataLoader
: properly diagnose exceeding file descriptor limit (#34768)- π·
torch.utils.data.DataLoader
: added repr for WorkerInfo (#39975) torch.utils.data.random_split
: added option to pass a generator for determinism (#34043)- β
torch.utils.data.IterableDataset
: make the warning for when a DataLoader holds an IterableDataset clearer (#41185) - π
torch.nn
: Added support for non-persistent buffers that do not show up in a Moduleβs state dict (#37191) - π
nn.Fold
,nn.Unfold
: added double backwards support (#36379) - π
nn.MultiheadAttention
: added support for bool/byteattn_mask
tensor (#33763) - π
nn.functional.upsample
: enabled uint8 sampling support (#35029) - π²
nn.functional.kl_div
: added option to accept target in log space (#34586) - π
nn.functional.softmax
: added support for sparse tensors (CPU) (#36305) nn.Softmin
,nn.Softmax
: improved repr (#39084)- β warnings: Changed warnings generated in cpp to show point of Python origination (#36052)
- β warnings: Improve warnings to actually point at user code (#39143)
- Extend some of the basic ops to kHalf (#37121)
- β Added a warning to a known autograd issue on XLA backend. (#35449, #35543)
- π
torch.cuda
: Change DeprecationWarning to FutureWarning (#32142) - Added
torch.utils.cmake_prefix_path
pointing toshare/cmake
folder (#38559) torch.hub
: Addedfile_name
argument toload_state_dict_from_url
(#39749)- π¨ Disable autograd while preparing Tensor for printing (#39420)
- π Improved CUDA error message for MSVC (#39987)
- π Improved reentrant autograd error message (#38625)
- π Let >> and << support half on CUDA (#37670)
- β‘οΈ dockerfile: Update miniconda installer download location & remove unnecessary flag (#37082)
torch.cuda.get_arch_list()
andtorch.cuda.get_gencode_flags()
added. These return the architecture list and gencode flags PyTorch was compiled with. (#41212)- π
torch.min, torch.max
: significantly improved CUDA performance (#38440, #39029) - π
torch.multinomial
withreplacement=False:
significantly improved performance (#39742)
Python Type Annotations
torch.autograd
: add type hints in-line (#38080)torch.finfo
,torch.iinfo
type annotations added (#38220)- π Moved
torch.cuda
annotations inline (#40075) - Add typing for
torch.cuda._CudaStreamBase
andtorch.cuda._CudaEventBase
classes (#40256) - Introduced
torch.types.Device
and stubbed alltorch._C
functions comprehensively (#38173) - π Move all
torch.nn
modules type annotations inline (#38211) - π Fixes type annotations for named tensors (#36890)
- π Fix minor issue in type stub for Optimizer (#38067)
- π Fixed some miscellaneous type hints (#36584)
- π Fix multiple issues with type annotations (#36358)
- π
torch.autograd.anomaly_mode
: fixed type hints stub (#39324) torch.backends.cudnn
added type annotations (#38947)torch.channels_last
,torch.preserve_format
: added annotations (#39120)
AMD/ROCm
- π
torch.topk
: enabled support for BFloat16 type on ROCm. (#34849) - π
torch.dot
: enabled fp16 support on ROCm (#30431, #30432) - π
torch.add
: enabled support for BFloat16 type on ROCm for sparse tensors(#35978) - Enabled bfloat16 for operators in BERT model (#37634)
- π²
torch.log
: improved ROCm support (#40079) - π
torch.pow
,torch.exp
,torch.erf
: enabled support for BFloat16 type on ROCm (#40236)
C++ API
- π Eliminate warnings for cpp extensions on Windows (#37400)
- π Disable C4251 when compiling
cpp_extensions
on Windows (#35272)
π Note: Above two PRs eliminate unnecessary compile warnings for windows build, make build log more readable.
Distributed
- β¨
torch.distributed
: Enhance error message for MPI unavailability. (#36781). torch.distributed
: Exposetorch.distributed.is_available()
API (#37021).- π
torch.utils.data
: Only createtorch.generator
and seed inDistributedSampler
when shuffling (#37604). - π²
ProcessGroup
: Log incorrect device inProcessGroupGloo
(#38844). - π
torch.utils.data
: ImproveDistributedSampler
docs and add seed option (#39628). torch.cuda.comm.reduce
: Avoid initializing unnecessary tensors innccl.reduce
(#39688).- π
torch.nn.parallel.DistributedDataparallel
: Remove obsolete warning message from DDP (#40190).
Distributions
distributions.Cauchy
: Implemented kl divergence (#36477)distributions.Transform
: Add a.with_cache()
method (#36882)distributions.Binary
: Implemented BTRS algorithm for fast/efficient binomial sampling (#36858)
Internals
- π New macro
TORCH_FN
for passing in compile time function pointers as regular function arguments rather than template arguments (#39823, #40110) - π Improved support for more types in registered custom kernels
- β Added FPGA DispatchKey, DeviceType, Backend for out-of-tree experimentation (#38938)
- π Better type safety for calling the dispatcher; we now do a runtime test when casting OperatorHandle to TypedOperatorHandle that youβve provided the correct type for kernels (#40251)
- OperatorHandle::callBoxed now works on all operators, you no longer need to manually go through JIT registry (#36010, #36850)
- β Added Dispatcher::redispatch for performing a dispatch that bypasses the current key and all keys before it (#35476, subsequently renamed)
- π More operators are fully supported by the dispatcher (#37273, #36564, #36398, #36666, #36838)
- Tracing is no longer done inside our autograd code; instead it has been factored into a separate Tracing dispatch key (#39514, #38467)
- DispatchKey computation no longer relies on TensorOptions; instead, factory functions and other functions with special dispatch key computation needs can register a BackendSelect kernel to compute the required key. (#36290, #36562, #37257)
ONNX
- Enable Constant Folding for ONNX Opset 12 (#34823)
- β‘οΈ ONNX Update training ops and training amenable export API (#35567)
- π Fix for constant folding: Slice, Added ReduceL1 and ReduceL2 (#35280)
- β Added support for constant folding onnx::Add and onnx::Sub (#35869)
- Enable constant folding for Shape (#35386)
- π Improve error checking for large model export (#37798)
- β Remove Aten ops from ONNX export (#37239)
- β‘οΈ Update pytoch/onnx doc (#39480)
- β‘οΈ Update pytorch/onnx docs for new export API args (#39802)
- π Support large attribute and subgraph for large model (#38793)
Operator Benchmark
- β Added benchmark for quantized batchnorm (#35389)
- β Added more quantized activation benchmarks and input sizes (#35729)
- Added
__torch_function__
benchmarks (#36138) - Aligned qconv benchmark to conv (#36673)
- Aligned the qlinear benchmark to linear (#36674)
- β Added CUDA support for the observer benchmark (#39360)
Profiler
torch.autograd.profiler
: Make RecordFunction callbacks thread local and modernize interface (#37491)torch.autograd.profiler
: Make profiler thread local (#36291)
Quantization
- β Add ConvBn3d, ConvBnReLU3d, BNReLU2d, BNReLU3d to eager mode quantization (#33540)
- Enabled per channel quantized static linear/conv in QNNPACK (#37622)
- Enable per-channel quantization for LSTM Modules (#39666, #39041)
- π Dynamic quantization support for LSTMCell, RNNCell and GRUCell (#40102)
- Quantization aware training now works with nn.DataParallel and nn.DistributedDataParallel
- β Add quantized tensor support on CUDA (#37081)
- Add reduce_range params for quantized_lstm (#39604)
- π Use TorchBind for ConvPackedParams (#35923)
- π Use TorchBind for Linear PackedParams" (#38101)
RPC
torch.distributed.rpc.RRef
: Throw an actionable error message on user callRRef.to_here()
in TorchScript (#35369)torch.distributed.rpc.RRef
: Handle exceptions returned viaremote()
calls (#35331)- π»
torch.distributed.rpc.RRef
: Make RRef type hint mismatch exception message more actionable to users (#35943) torch.distributed.rpc
:Allow abortRecvWork::wait()
inProcessGroupAgent::listenLoop
(#36084)torch.distributed.autograd
: Appropriately handle exceptions in autograd engine. (#36019)- π»
torch.distributed.autograd
: Catch exception in distributed engine callbacks. (#36118) torch.distributed.autograd
: Avoid some future callback self-captures. (#36502)torch.distributed.rpc
: Propagate error from RPC retries to the original attempt (#35263)torch.distributed.autograd
: Ensure future is complete when exitingEngine::mark_graph_task_completed()
(#36856)torch.distributed.autograd
: Trigger pre/post hooks of output function nodes under distributed autograd (#34501)- π
torch.distributed.rpc
: Supporting create an RPC gang of world size 1 (#32731) torch.distributed.autograd
: Improve Error Message for Dist Autograd Context Cleanup Failure (#37255)torch.distributed.rpc
: Guard against negativerpcTimeout
being passed in toRpcBackendOptions
(#38267)- β±
torch.distributed.rpc
: Use infinite timeout for operations in ProcessGroup RPC backend (#38577) - π·
torch.distributed.rpc.WorkerInfo
: Add stringifyWorkerInfo
(#39974) - 0οΈβ£
torch.distributed.rpc
: Avoid using default process group in ProcessGroupAgent. (#39909) torch.distributed.rpc
: Ignore expected errors in TensorPipe RPC backend (#39182)torch.distributed.rpc
: Don't use separate heap allocation for metrics in TensorPipe RPC backend (#39183)torch.distributed.rpc
: Bind to hostname's IP address instead of localhost in TensorPipe RPC backend (#39184)torch.distributed.rpc
: Use PrefixStore to avoid conflicting keys in TensorPipe RPC backend (#39185)
TorchScript
π Improvements
- β Add
id
function (#34975) - β Add lazy script decorator (#34935)
- π Make Future type annotation available in Python (#27637)
- π Support converting
str
tofloat
(#35352) - Enable recording of TorchScript functions (#34710)
- π Improve the error message when registering a custom class twice (#35568)
- π Improve optimization of
if
statements with statically determinable predicates (#35834) - π Fix reporting of error message in
toBool
(#35570) - π Better error when types of default value and parameter do not match (#35888)
- π Improve serialization for lists and dictionary (#35741)
- β Add type hints on
hardsigmoid
,hardswish
, andelu
to make them scriptable (#35885) - β Add
strict
tracer flag to guard against risky behaviors (#36277) - Add support of
Dict
as output when connecting script and tracing (#36265) - Use current default
dtype
withtorch.tensor
whendtype
is not specified (#36587) - Add dictionary as output of tracer (#36696)
- Allowing casting
str
toint
(#36016) - Convert float Tensor argument to double in
Tensor.tolist
(#37465) - Add a
code_with_constants
method to module printing (#37586) - Support indexing using list literal as index (#37848)
- Support indexing using list variable as index (#37966)
- Support
del
statements with variables as targets in TorchScript (#37608) - Recursively compile TorchScript class types (#38050)
- Better error message when missing
init
on custom C++ classes (#37474) - Fix
@staticmethod
access fromself
on modules (#37702) - Allow
@torch.jit.unused
to be used on TorchScript classes (#38522, #39336) - Add support for
%=
operator in TorchScript (#38983) - Provide error messages when JIT infers the type of an argument as
Tensor
(#38527) - Allow self-referential type annotations in TorchScript classes (#39821)
- Support having a different forward method when not in scripting mode (#38158)
- Fix
index_put_
error in subscript assignment (#38378) - Refactor attributes to support buffers and parameters as first class citizens, add support for iterating over named_buffers() (#37905)
- π Add ROCm-specific
half_support_literal
(#38899) - Make
torch.unique_consecutive
compilable (#39339) - Make
deepcopy()
of Objects callg/setstate
if present (#39500) - π Allow slicing sequential container (fe45c2c)
- Support
torch.Tensor
subclasses (likeParameter
) as inputs to functions (#39487) - Add
dtype
as supported type annotation (#39741) - Improve error message when type annotation Future without a contained type (#39751)
- Fix inconsistent results of string
split
func (#38772) - Support
pad_sequence/pack_sequence
(#39844) - Enable
copy.deepcopy
andcopy.copy
forRecursiveScriptModule
(#32685) - π Fix zip serialization for file > 2GiB (0c90b6d)
- π Fix
dictConstruct
ordering and enable dict mix (41816dc) - π Fix delegating to
jit.load
fromtorch.load
(#41013) - β Add distributed
backward
support (#38494)
π Bug Fixes
Python API
- π
torch.cat
: fixed missing type promotion (#35030, #39777) - π
torch.gather
: fixed silently incorrect results when in-place gather tries to use incorrect shapes (#37102) - π
torch.median
: fixedNaN
comparison (#38216) - π
torch.cdist
: fixed backward calculation forp=2
(#37337) - π
torch.eig
: fixed segfault when input has NaNs and infs (#37642) torch.irfft
: stopped modifying the input in-place (#35219)- π
torch.max
,torch.min
,torch.median
: fixed incorrect backwards implementation (#36316) - π
torch.fmod
: fixed crash on division by zero (#38919) - π
torch.multinomial
: fixed support for tensors with empty batch (#39873) torch.einsum
: fixed incorrect__torch_function__
handling (#38741)- π
torch.remainder
: fixed overflow when dividend is very large (#37758) - π
torch.remainder
: fixed precision issues for CPU tensors (#38293) - π
torch.argmax
,torch.argmin
: fixed bug for big CPU tensors withdim=2
(#39576) - π
torch.histc:
fixed support when passed empty tensor (#38987) torch.as_strided
: added error message when passed a negative stric=de (#39508)- π
torch.argmax
,torch.argmin
: fixed bogus returns when called on a scalar tensor (#37214) - π
torch.topk
: Fixed bogus results with 4d+ input tensors with topk dimension >= 1024/2048 on CUDA (depending on GPU) (#40349) - π
torch.mv
: Fixed bug when grad hasstride=0
on GPU in the backward pass (#38321) >>
,<<
on CUDA changed to match the behavior on CPU for certain compiler variants (#35339)- π
Tensor.exponential_(0)
fixed to return a Tensor filled withinf
(#36837) - π
Tensor.to(..., non_blocking=True)
: fixed regression wherenon_blocking
is ignored (#35144) - π
Tensor.to
: fixed CUDA negative float to uint8 cast to be consistent with CPU (#36832) - π Fixed incorrect binary pointwise operations when the first argument is a scalar (#39956)
- π
Tensor.copy_
: Fixed error when used with AMD devices (#38003) torch.tensor
: fix segfault in error checking in Tensor constructor (#40106)- π Fix overflow issues when constructing tensors with large numbers (#39140)
- π Fixed regression in unary ops casting to output dtype (#41097)
- π
nn.Module
: fixed AttributeError reporting fornn.Module
's properties (#34324) nn.MaxPool2d
: fix for returning wrong shape withreturn_indices=True
on CUDA (#38992)nn.MaxPool2d
: fix NCHW backward bug (#38953)- π
nn.MaxPool2d
: fixed dilated case (#36288) nn.MultiheadAttention
: Removed weights from__constants__
to fix warnings when converting to TorchScript.- π
nn.ConvTranspose2d
: fixed error in backward pass for fp16 inputs. (#37569) - π
nn.ConvTranspose3d
: fixed index overflow (#39198) - π
nn.RReLU
: fixed memory leak (#39347) - π
nn.PReLU
: fixed stack overflow in backward pass (#36134) - π
nn.MultiheadAttention
: fixed assertion to support FP16 training (#37539) - β‘οΈ
nn.MultiheadAttention
: Updated assert to remove check on 3rd dim for MHA (#39402) - β‘οΈ
nn.ModuleDict
,nn.ParameterDict
: fixed bug in updating with anotherModuleDict/ParameterDict
, respectively (#27814) - β
nn.BatchNorm
: fixed buffer update whentrack_running_stats
is set to False (#38084) - π
nn.MaxPool3d
: fixed incorrect CUDA backward results for non-square output (#36820) - π
nn.DataParallel
: fixed support for empty tensors (#35965) - π
nn.functional.grid_sample
: fixed out of boundary bug when grid contains large numbers (#35506) nn.functional.max_pool2d
,nn.functional.avg_pool2d
: fixed issue when stride=None (#39221)- π
nn.functional.max_pool2d
: fixed erroneous dimension out of range on CUDA (#36095) nn.grad._grad_input_padding
: fixed support for dilation argument (#33872)nn.functional.log_softmax
: improved accuracy on CUDA (#38945)- π
nn.utils.prune
,nn.utils.weight_norm
: fixed problems when used with RNNs (#34170) - π Fixed nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903)
nn.functional.interpolation
: nearest interpolation implementation fix for CUDA (#39055)torch.utils.mkldnn.to_mkdnn
: covernn.Conv1d
in mkldnn model conversion logic (#38528)torch.utils.data.DataLoader
: Relax sampler check in BatchSampler (#38403)- π»
torch.utils.data.DataLoader
: The exception raised when RandomSampler.replacement is non-boolean should be TypeError (#36547) torch.utils.data.DataLoader
: Correct a ValueError in dataloader to TypeError (#36244)torch.utils.data.DataLoader
: Allow shuffle when auto-batching is disabled (#39865)- π·
torch.utils.data.DataLoader
: Kill DataLoader workers when we can't join to clean up gracefully (#39869) - 0οΈβ£
torch.utils.data.Dataloader
: Added error when usingdefault_collate
on lists of unequal size (#38492) - π Fixed crashes on
import torch
related to defining static data in Vec256 (#37767, #38088) - For
out=
operations, preserve output tensor's strides if it is correctly sized (#38895) - π
cuda
: fixed a bug where it was possible to incorrectly access the CUDA device before it was initialized (#36714) - π
torch.device
: Added better device idx parse checks (#37376) - π
torch.autograd
: fixed init-shutdown race condition in autograd engine (#39194) torch.autograd
: Fixed error when using hooks with no__name__
attribute- π
torch.autograd
: Fixed error message (#39729) torch.autograd
: wait for non-reentrant threads to shutdown (#34529)- π
torch.autograd
: Add undefined tensor gradient support to all backward functions (#39400) - π
torch.autograd
: fixed engine flakiness (#35599) - π
torch.autograd.Function
: fixed ability to report error messages inside (#34845) - π
torch.autograd
: move scalar input to a different device when needed; fixes backward passes of binary-pointwise operators with scalar inputs (#35286) - π
torch.autograd.gradcheck
: Fixed behavior forstride=0
(#38774) torch.autograd.Function
: prevent custom Functions from creating non differentiable type that requires grad (#38326)- π
torch.no_grad
: Fixed bad interaction betweentorch.no_grad
andtensor.numpy()
conversion (#38906) - π
torch.optim.AdamW
: fixed error message (#36088) - β‘οΈ
torch.optim.Optimizer.state_dict()
fixed non-determinism (#37347) torch.hub
: added optional request headers to avoid βconnection refusedβ errors (#39740)- π
torch.hub.hub_dir
: fixed inconsistencies (#38969) - π OpenMP: fixed memory leak for
num_threads==1
with operations that use OpenMP (#39533) - π
torch.multiprocessing
: Fixed deadlock when sharing CUDA tensors (#40347) torch.distributions.Binomial
: fix bug where there is a small chance of incorrectly returning -1 (#38456)- π
torch.cuda.amp.GradScalar
: fixed bug whereGradScalar
was not pickle-able (#38296) - β¬οΈ Fixed uninitialized value in helper function
vec_reduce_all
(#37853) - Fixed potential memory corruption in helper function
cpu_serial_kernel
(#37869) - π Synchronize MAGMA functions with the current CUDA stream (#36605)
- π Windows support: Fix openmp detection with the clang-cl compiler (#35365)
- π Windows support: Use
ProgramFiles
environment variable on Windows for portability (#39707) - π Windows support: Fix AVX detection with clang-cl (#35653)
- π Windows support: Delay loading the cuda library until it is necessary (#37811)
- π Windows support: Fix
_copysign
is not a member of std (#35199) - π Windows support: Fix zip serialization for files > 2GiB (#40783)
- π Windows support: Add runtime check for MSVC redist, fixed
import torch
errors (#39841) - π Windows support: More fixes about using Windows API through ctypes (#39376)
- π Windows support: fixed
import torch
errors (#39334) - π Windows support: Fix wrong MSVC version constraint for CUDA 9.2 (#40794)
- π Windows support: Use LoadLibraryEX, fix problems when loading dlls (#38302)
- π Windows support: Fix dll load failure in virtual environments (#39622)
- Windows support: Make
find_first_set
work on x86 MSVC (#38637, #38706) - β Removes pickle deprecation warning (#39003)
- π³ dockerfile: Sync submodules (#35423)
- π Fix crashes in
manywheels
builds related to having specialCUDNN
search path rules fortorch_python
(#37349) - *
torch._six.PY37
should be true for Python-3.8 as well (#40868) *
AMD/ROCm
- β Stopped erroneously warning about CUDA compute capabilities (#35949)
- Stopped using MIOpen for tensors with more than
INT_MAX
number of elements (#37110) - Enable HgemmBatched for ROCm (#37483)
- π Fix encoding errors for hipify tool (#37906)
- β Added HIP version guard for occupancy API compatibility (#38551)
- π Fix the processing logic of
bernoulli
(#40001) - π Use correct device type when exporting tensors to DLPack (#40124)
C++ API
- π Fixed the crash problem when using
BuildExtension.with_options
(#40121) - π Fixed the dir permission denied problem when multiple user building cpp_ext on the same machine (#34239)
Distributed
- π
torch.nn.SyncBatchNorm
: Fix batch size check. (#37133). torch.nn.parallel.DistributedDataparallel
: Fix DDP error checking for unused parameters (#36054).torch.nn.DataParallel
: EnsureDataParallel
replicas can be pickled (#37307).torch.distributed
: EnsureNCCL_BLOCKING_WAIT=1
works fordist.barrier()
(#40249).- π
torch.nn.SyncBatchNorm
: Avoid blocking host thread when usingSyncBatchNorm
(#36659). - π
torch.cuda.comm.gather
: FixGather::apply
to avoid accessing moved tensors (#39733). - 0οΈβ£
torch.nn.parallel.DistributedDataparallel
: Add a guard to allow DDPβs autograd engine callback to work in a with non-default CUDA streams (#40115).
Internals
- β Add missing mutex for listener removal (#35486)
- β Add missing mutex for fallback register/deregister (#36628)
- π Improved boxed dispatch performance (#33313)
- π¨ Refactored jit::Operator to more clearly distinguish the two possible states: c10 vs jit (#33905, #36634)
- Per device initialization now occurs in backend kernels via code generation, rather than during backend selection (#37402)
- π Improved support for dispatcher on mobile
- π Improved error messages
ONNX
- π Fixes default dtype value for onnx hardtanh export (opset11) (#35467)
- disable size optimizations for onnx (#36243)
- β Adding a pass to replace interpolate function with
aten::__interpolate
(#35744) - π fix
provider_version
and add consistency test (#36797) - π Fix numerical errors in softmax when dim is not last dimension (#37326)
- β make onnx expect tests resilient to producer_version changes (#39002)
- β Enable models tests (#38791)
- β Enable Constant Folding Tests (#38751)
- β¬οΈ Bump up ONNX submodule to a82c6a7010e2e332d8f74ad5b0c726fd47c85376 (#39372)
- π Fix type casting for reduce ops (#38829)
- π Fix ONNX export of RNNs with no bias (#36894)
- π Fix regression disabling checker (#39073)
- π Fix KeypointRCNN test (#39589)
- π Fix bug in export of ops involving torch.bool type (#40006)
- π Fix bug in export of cumsum operator (#40044)
- Set onnx opset version before model select (#37466)
- β Enable tests for opset 12 (#37846)
- Enable tests in
test_pytorch_onnx_onnxruntime
(#37868) - β Enable tests in test_operators.py (#39431)
Operator Benchmark
- π Fixed missing comma in activation benchmarks (#35104)
- π Fixed bug where activation benchmarks didnβt run anything (#35731)
- Replaced
import cpp_benchmark
withtorch.utils.cpp_benchmark
(#38832)
Profiler
torch.autograd.profiler
: Usehigh_resolution_clock
for profiling on Mac (#37280)- π
torch.autograd.profiler
: Fixes for profiling JIT code (#38453) torch.autograd.profiler
: Destroy CUDA events after profiling (#39962)
Quantization
- π Fix a bug for convolution bias in QAT Conv-BN (#36173)
- Ensure that histogram observers have zero-point of zero for post ReLU activations (#37107)
- Unify numerics between fakequant and quant/dequant (#37188)
- π Release qnnpack original weights for conv/linear (#37595)
- π Fix histogram observer with 0 input (#40191)
- Histogram observer bug fix with min == max (#40310)
- β Add save/load state_dict to quantized dynamic RNNs (#39105)
- Ensure qconv doesn't assert with empty batch (#38252)
- π Support empty batch input for quantized ops (#38508)
- π Fixed CUDA memory pinning (#41139)
RPC
torch.distributed.autograd
: Respect dist autograd context intorch.jit._fork
. (#34360)torch.distributed.autograd
: Continue tryingsend()
even if onesend()
failed when cleanup distributed autograd contexts (#34943)- π
torch.distributed.rpc
: In ProcessGroup RPC backend, avoid read-after-free (#35252) torch.distributed.rpc
: Fixaten::wait
for RPC futures(#35695)torch.distributed.rpc
: Fixprim::rpc_async
for RPC futures (#35994)- β±
torch.distributed.rpc
: Only Schedule Retries before Agent Shutdown (#35554) torch.distributed.rpc
: CallthreadPool.waitWorkComplete
afterlistenerThread.join()
to fix graceful shutdown (#35394)torch.distributed.rpc
: Fixing Potential TSAN issue with joining RPC helper threads (#36094)torch.distributed.rpc
: Fix race during RPC shutdown. (#36113)torch.distributed.rpc
: Fixing RPC shutdown and thread joining (#36239)torch.distributed.autograd
: Capture global state, distributed autograd current context id, before thread switching triggered by JITfuture.wait()
(#36395)torch.distributed.autograd
: Fix race inmark_graph_task_completed
. (#36640)torch.distributed.rpc
: Acquire GIL when constructing/destructingConcretePyObjectHolder
(#37870)torch.distributed.rpc
: Explicitly decrefpy::object
inConcretePyObjectHolder
andPythonFunctionGuard
(#38364)torch.distributed.rpc
: Explicitly decrefpy::object
inPythonRpcHandler
(#38366)torch.distributed.rpc
: Keeppy::object
alive untiljit::toIValue
returns (#38348)torch.distributed.rpc
: Use GIL to guard decref ofjit::toPyObj
return value inprocessRpc
(#38376)torch.distributed.rpc
: Use Future'sthen()
API to make sure profiling logic is completed when the Future completes (#38352)- β±
torch.distributed.rpc
: Fix timeout computation in TensorPipe agent(#38928) - π
torch.distributed.rpc
: Fix lock inversion upon response read error handling (#38929) - π
torch.distributed.rpc
: Acquire lock when adding message to timeout map to fix race in TensorPipe RPC backend (#39398) torch.distributed.rpc
: Explicitly decref inUnpickledPythonCall
dtor (#38398)torch.distributed.rpc
: Fix possible deadlock in_wait_all_workers
(#39535)- π
torch.distributed.rpc
: Release GIL when deleting users and unforked owners (#39555) torch.distributed.rpc
: Fix error handling forrpc.remote
(#39605)torch.distributed.rpc
: Fix RRef alias annotation (#39933)torch.distributed.rpc
: Fix TensorPipeAgent shutdown to ensure it drains all outstanding work. (#40060)torch.futures
: Lettorch.futures.wait_all()
re-throw errors (#40291)- π
torch.distributed.autograd
: Add basic GPU support to distributed autograd. (#40312)
TensorBoard
- π
summary.hparams
: SupportNone
inhparams_dict
(#36497) - π
SummaryWriter.add_scalars()
: Removed incorrect documentation (#36495) SummaryWriter.add_embedding
: Fix error where NaN appears in some cases (#36496)SummaryWriter.add_hparams
: Fix input parameters (#31301)SummaryWriter.add_image_with_boxes
: Added option to add strings to image boxes (#30941)- π
SummaryWriter.add_graph
: Fixed missing documentation (#37504) SummaryWriter.add_hparams
Let hparam render values correctly (#31544)- Enforce tensorboard minimum version as 1.15 (#35952)
TorchScript
- π Fix scope of writes in comprehensions (#36105)
- π Fix name collision during module loading (#35720)
- π Fix
NamedTuple
resolution (#35409) - Fix copying of bound method from
Module
toScriptModule
(#36546) - Fix lifting bug in tracing module calls (#37189)
- Fix tracing of return types for modules that return heterogenous tuples (#37190)
- Add type-hint check for default arguments in TorchScript C++ frontend (#39021)
- Fix recursive compilation of function annotated with
@torch.jit._script_if_tracing
` (#40468) (#40468) - Fix parsing of subscript expressions using python resolver (#39269)
- π Fix compilation error with gcc 5.5 (#38112)
- π Fix handling of
aten::masked_select
, properly update type of theaten::unsqueeze
's output in shape analysis (#40716) - π Fix handling of
aten::unfold
, properly handle default dtype, and fix a gradient thrashing issue in shape analysis (#41044) - π Fix a bug with incorrect handling of
aten::view
in autodiff graph construction (#42029) - π Fix a bug with incorrect handling of constructor operations with tensor inputs tensor properties based on an input tensor rather than defaults in shape analysis (#41016)
- π Fix bug with incorrect handling of
prim::grad
operation forUndefined
values in shape analysis (#41015) - π Fix the incorrect requires_grad property propagation on loopβs block inputs (#41014)
π Performance
Misc
F.avg_pool2d
: added specialized kernel for channels-last (#35855)- π Relax cudnn conditions for channels-last convolutions (#38904)
torch.cat
: Enabled fast path for channels-last inputs (#39448)torch.index_put
parallelized accumulate CPU float path withcpu_atomic_add_float
(#29705)- π Make discontiguous tensors also benefit from unrolling (#34708)
- π
torch.scatter
,torch.gather
: removed some redundant checks to achieve some speedups (#34690) - π
torch.scatter
,torch.gather
improved performance on CUDA (#36181) - π
torch.min(tensor, dim)
,torch.max(tensor, dim)
: Optimize performance on CPU (#34875) - π
torch.index_select
: Optimize performance for 1D inputs (#35243) - Vectorize (CPU) generic types for binary bitwise operators (#34338)
torch.linspace
vectorized on CPU. (#27957, #34555, #35842, (#37981, #38093)- Set device only when device index are different (#35438)
- Don't replace TensorImpl for inplace min/max dim (#35591, #39696)
torch.clamp
vectorized for bfloat16 (#35082)- bfloat16: vectorized many unary ops (#35092)
- β‘οΈ
torch.bincount
optimized for CPU by removing extrasize()
calls (#35822) - π Improve reduction op performance on CUDA for large tensors (#35997, #36014)
- Vectorize in-place comparison operators (#35117)
- β Vectorize reduction when reducing on fastest striding dimension (#36873)
nn.EmbeddingBag
: add a fast path that calls FBGEMM (#36679)- π
nn.Conv3d
: Optimized grouped Conv3d performance (#36355) - β¬οΈ Reduce overheads on several CPU kernels by avoiding restrides. (#36875)
nn.EmbeddingBag
: uninitialize output andbag_size
in the fast path to save overhead (#36681)nn.SmoothL1Loss
: vectorize forward (CPU) (#37114, #37115)- β‘οΈ
nn.Unfold
: optimized backward pass (#36612, #38871) - β Add per-device allocator object in CUDACachingAllocator, reducing lock contention between operations on different devices. (#37567)
- Lazily initialize thread local num_threads value (#37461)
- Vectorize non-persistent Softmax (#38557)
- π
nn.GroupNorm
: performance optimized on CPU and CUDA (#28203, #28204) - βͺ
torch.cumsum
,torch.cumprod
: Restore thrust path for 1d tensors cumulative ops (#39180) - TensorIterator: Remove unnecessary
!op.is_read_write
test (#39747) torch.multinomial
: fast-path for replacement=False (#39742)- Vectorize on output for reduction kernels (#37206)
- π
nn.UpSample
: optimized performance for linear modes on CPU (#34864) - π Make dynamic casting case also benefit from unrolling (#34749)
torch.sinh
,torch.cosh
: vectorized on CPU (#36396)- π Speed up sparse tensor gradient accumulation (#36292)
torch.masked_select
sped up (#36539, #33269)torch.var
,torch.std
sped up (#39967)torch.max(tensor, dim)
,torch.min(tensor, dim)
sped up (#39029)
Distributed
- π
torch.nn.SyncBatchNorm
: Speed upSyncBatchNorm
by batching distributed communication (#38246). - π¦
torch.nn.parallel.DistributedDataparallel
: Dynamically adjust DDP bucketing order using the signals collected from the first iteration (#35137).
Mobile
- π Use XNNPACK to improve performance for some instances of convolution and linear. (#35790) (#35791)
- π Use a custom allocator on mobile to automatically include padding for {Q,X}NNPACK, reducing reallocation costs. (#36032)
- π Use updated open-source pthreadpool to improve multi-threading performance. (#40951)
Quantization
- qmul and qadd should preserve input memory format (#34834)
- β remove the slow path(NCHW) for avg_pool3d (#34994)
- β‘οΈ Optimized qadd_scalar (#34925)
- Optimize qavg_pool3d_nhwc (#35740)
- π Changes to qadd for perf improvement. (602b51e)
- π improve the quantized batch_norm performance (#35639)
- β Add vector path to copy kernel for quantized data types (#36189)
- Speed up calculate Qparams for per-channel observers (#30485)
- Enable float requantization for avgpool/gavgpool ops. (#37037)
- π Move to using MemoryFormat::ChannelsLast for avgpool2d. (#36812)
- π Use
gpu_kernel
in Affine Quantizer (#37312) - Perf optimization for conv and gemm kernels. (#37626)
RPC
torch.distributed.rpc
: In RPC Server, handle TorchScript continuations asynchronously (#34109)- π
torch.distributed.autograd
: Avoid holding lock when completing GraphTask futureResult (#35101) - π
torch.distributed.autograd
: Lock optimizations forDistAutogradContainer
(#36529) torch.distributed.rpc.RRef
:PreventRRef.to_here()
to block an RPC thread on the callee using Future callbacks (#36805)torch.distributed.rpc.RRef
:PreventRRef
unpickle to block waiting forOwnerRRef
creation (#36785)- π
torch.distributed.autograd
: Remove spinning for dist engine (#36606) - π
torch.distributed.rpc
: Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521)
TorchScript
- Add vectorized load/store support for JIT generated CUDA kernel (#36555)
- Speed up alias analysis (#36345)
- Make new zip serialization for torch save/load significantly (~70%) faster (#38379)
- β Run extra optimizations after inlining (#35562)
π Documentation
- π Split up documentation into subpages, greatly improving performance and search-ability (#37419)
- π Rename
torch._C.Generator
totorch.Generator
(#38773) - π FAQ: Add note about recovering from OOM (#35214)
torch.histc
: Add a note on elements outside of given bounds (#34889)- π
functional.hardswish
,functional.hardsigmoid
: improve docs (#35431) Tensor.is_complex
doc fix (#35680)nn.KLDivLoss
doc fix (#36137)torch.min
,torch.max
,torch.median
: added note on deterministic/non-deterministic gradient (#36481)- Amp gradient accumulation example (#36601)
functional.softmax
doc fix (#36600)- β‘οΈ Update
contribution_guide.rst
(#36438) - π Documentation LU Decomposition: deriving L, U, and P (#36907)
- π CONTRIBUTING.md: Fixed missing links (#37131)
- β Add links to more subdir READMEs in CONTRIBUTING.md (#38049)
- π
torch.isclose
: Adds missing documentation . (#37295) - π Improve checkpoint docs to warn users about detached gradient issues (#37266)
- β Add documentation about multithread autograd (#37020)
tensor.view
doc improved (#36728)- π
nn.MultiheadAttention
: Fixed typo in documentation (#37496) - π
contribution_guide.rst
andgovernance.rst
: fixed broken links (#37820) - π
nn.Hardsigmoid
andnn.functional.hardsigmoid
documentation added (#38120) - π
nn.FeatureAlphaDropout
documentation added (#36295) - π
nn.functional.hardwish
documentation added (#37989) - π
torch.bucketize
,torch.searchsorted
documentation added (#38119) - Documented bfloat16 dtype and BFloat16Tensor (#37051)
nn.Linear
fix sample code (#38002)- β
torch.index_add
: add missing args for index_add (#38213) Tensor.is_nonzero
doc added (#37845)functional.upsample
: Correct upsample doc to match interpolation (#38455)nn.CTCLoss
: added target un-pad example (#38393)- π
nn.SyncBatchNorm
: improved doc (#38423, #38890, #39646) torch.utils.cmake_prefix_path
documented (#38727)- π
torch.logcumsumexp
: fixed formula (#38952) - β Add C++ Landing Page (#38450)
- π use version number instead of 'master' in html header title (#38149)
- π Docs fix: Added missing indent (#35017)
- π Fix many doc issues (#37099)
- π docs: Fixed docstring indentation for documentation (#37739)
- β Remove old section of the aten doc that is not true anymore (#35807)
- β Removed Python 2 references (#36336)
- π Fix multiline signatures in docstring (#38768)
- β Removed java documentation (#38920)
- π Fix missing code in 'Installing C++ distribution of Pytorch' (#39237)
nn.MultiheadAttention
: Updatekey_padding_mask
arg docs (#39321)- β‘οΈ
torch.squeeze
,torch.split
,torch.set_printoption
,torch.save
docs updated. (#39303) - π
utils.cpp_extension
: correct some usages and docs (#39766) nn.BatchNorm
,nn.InstanceNorm
: Clarify that variance estimator is biased for normalization layers (#39752)- β Remove duplicated entries in
random.rst
(#39725) - π Fix
Tensor.tolist
signature in the docstring (#39732 torch.save
: added note (#40394)- β‘οΈ Update docs feature classifications (#40539)
- π Fix autograd doc subsubsection display issue (#40796)
- β Add specific list of supported types in autograd (#38325)
C++ API
- β Added C++ autograd APIs and C++ Tensor indexing docs (#35777)
Distributed
- π
torch.distributed.launch
: Include launcher script docs in the distributed doc page (#40963) - β¨
torch.nn.parallel.DistributedDataparallel
: Enhance DDP doc strings for DDP + RPC support. (#39916)
Quantization
- π Minor fix to quantized conv docstring (#35134)
- π quant docs: clean up hardswish (#40323)
- π quant docs: add and clean up hardsigmoid (#40340)
- π quant docs: add and clean up hardtanh (#40341)
- π quant docs: add and clean up LayerNorm (#40342)
- π quant docs: add and clean up GroupNorm (#40343)
- π quant docs: add and clean up InstanceNorm{n}d (#40345)
- π quant docs: add and clean up BatchNorm{n}d (#40346)
- π quant docs: add and clean up ELU (#40377)
- π Docstring changes for dynamic quantized classes (#40931)
RPC
- β‘οΈ
torch.distributed.rpc.RRef
:Updated RRef docs to indicate RPC Retries (#36678) torch.distributed.rpc
: Add pointer to RPC parameter server tutorial (#37667)torch.distributed.rpc
: Add TensorPipe RPC Backend documents (#39467, #40222)torch.distributed.rpc
: FixProcessGroupRpcBackendOptions
Doc (#39787)torch.distributed.rpc
: fix RPC reference in top-level index (#40077)- π
torch.futures
: Add torch.futures to API docs (#40051) - βοΈ
torch.distributed.rpc
: Fix typos in RPC Docs (#40219) - π
torch.futures
: Improve torch.futures docs (#40245) torch.distributed.rpc
: Minor improvements for RPC documents (#40296, #40298, #40299, #40300, #40305, #35809)- β
torch.distributed.rpc.functions.async_execution
: Add a warning to mention that async_execution does not work with autograd profiler (#40309) - β
torch.distributed.rpc.functions.async_execution
: Add examples and tests for combining static/class method with async execution (#40619) (#40619) torch.distributed
: Add a link in RPC doc page to point to PT Distributed overview (#41108)
TorchScript