Changelog History
Page 5

v0.4.0 Changes
April 24, 2018π PyTorch 0.4.0 release notes
Table of Contents
 Major Core Changes
 Tensor / Variable merged
 Zerodimensional Tensors
 dtypes
 migration guide
 π New Features
 Tensors
 Full support for advanced indexing
 Fast Fourier Transforms
 Neural Networks
 Tradeoff memory for compute
 bottleneck  a tool to identify hotspots in your code
 torch.distributions
 24 basic probability distributions
 Added cdf, variance, entropy, perplexity etc.
 Distributed Training
 Launcher utility for ease of use
 NCCL2 backend
 C++ Extensions
 Windows Support
 ONNX Improvements
 RNN support
 π Performance improvements
 π Bug fixes
Major Core changes
β‘οΈ Here is a summary of the updates to the most important core features users will use daily.
Major Changes and Potentially Breaking Changes:
 π
Tensors
andVariables
have merged  Some operations now return 0dimensional (scalar)
Tensors
 π Deprecation of the
volatile
flag
π Improvements:
 π
dtypes
,devices
, and NumpystyleTensor
creation functions added  π Support for writing deviceagnostic code
We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.The contents of this section (Major Core changes) are included in the migration guide.
π Merging
Tensor
andVariable
classesπ
torch.autograd.Variable
andtorch.Tensor
are now the same class. More precisely,torch.Tensor
is capable of tracking history and behaves like the oldVariable
;Variable
wrapping continues to work as before but returns an object of typetorch.Tensor
. This means that you don't need theVariable
wrapper everywhere in your code anymore.π The
type()
of aTensor
has changedNote also that the
type()
of a Tensor no longer reflects the data type. Useisinstance()
orx.type()
instead:\>\>\> x = torch.DoubleTensor([1, 1, 1])\>\>\> print(type(x)) # was torch.DoubleTensor\<class 'torch.autograd.variable.Variable'\>\>\>\> print(x.type()) # OK: 'torch.DoubleTensor''torch.DoubleTensor'\>\>\> print(isinstance(x, torch.DoubleTensor)) # OK: TrueTrue
π When does
autograd
start tracking history now?π
requires_grad
, the central flag forautograd
, is now an attribute onTensor
s. Let's see how this change manifests in code.π
autograd
uses the same rules previously used forVariable
s. It starts tracking history when any inputTensor
of an operation hasrequires_grad=True
. For example,\>\>\> x = torch.ones(1) # create a tensor with requires\_grad=False (default)\>\>\> x.requires\_gradFalse\>\>\> y = torch.ones(1) # another tensor with requires\_grad=False\>\>\> z = x + y\>\>\> # both inputs have requires\_grad=False. so does the output\>\>\> z.requires\_gradFalse\>\>\> # then autograd won't track this computation. let's verify!\>\>\> z.backward()RuntimeError: element 0 of tensors does not require grad and does not have a grad\_fn\>\>\>\>\>\> # now create a tensor with requires\_grad=True\>\>\> w = torch.ones(1, requires\_grad=True)\>\>\> w.requires\_gradTrue\>\>\> # add to the previous result that has require\_grad=False\>\>\> total = w + z\>\>\> # the total sum now requires grad!\>\>\> total.requires\_gradTrue\>\>\> # autograd can compute the gradients as well\>\>\> total.backward()\>\>\> w.grad tensor([1.])\>\>\> # and no computation is wasted to compute gradients for x, y and z, which don't require grad\>\>\> z.grad == x.grad == y.grad == NoneTrue
Manipulating
requires_grad
flagOther than directly setting the attribute, you can change this flag inplace using
my_tensor.requires_grad_(requires_grad=True)
, or, as in the above example, at creation time by passing it in as an argument (default isFalse
), e.g.,\>\>\> existing\_tensor.requires\_grad\_()\>\>\> existing\_tensor.requires\_gradTrue\>\>\> my\_tensor = torch.zeros(3, 4, requires\_grad=True)\>\>\> my\_tensor.requires\_gradTrue
What about
.data
?π
.data
was the primary way to get the underlyingTensor
from aVariable
. After this merge, callingy = x.data
still has similar semantics. Soy
will be aTensor
that shares the same data withx
, is unrelated with the computation history ofx
, and hasrequires_grad=False
.π However,
.data
can be unsafe in some cases. Any changes onx.data
wouldn't be tracked byautograd
, and the computed gradients would be incorrect ifx
is needed in a backward pass. A safer alternative is to usex.detach()
, which also returns aTensor
that shares data withrequires_grad=False
, but will have its inplace changes reported byautograd
ifx
is needed in backward.Some operations now return 0dimensional (scalar)
Tensors
Previously, indexing into a
Tensor
vector (1dimensional tensor) gave a Python number but indexing into aVariable
vector gave (incosistently!) a vector of size(1,)
! Similar behavior existed with reduction functions, i.e.tensor.sum()
would return a Python number, butvariable.sum()
would retun a vector of size(1,)
.π Fortunately, this release introduces proper scalar (0dimensional tensor) support in PyTorch! Scalars can be created using the new
torch.tensor
function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent ofnumpy.array
). Now you can do things like:\>\>\> torch.tensor(3.1416) # create a scalar directlytensor(3.1416)\>\>\> torch.tensor(3.1416).size() # scalar is 0dimensionaltorch.Size([])\>\>\> torch.tensor([3]).size() # compare to a vector of size 1torch.Size([1])\>\>\>\>\>\> vector = torch.arange(2, 6) # this is a vector\>\>\> vector tensor([2., 3., 4., 5.])\>\>\> vector.size() torch.Size([4])\>\>\> vector[3] # indexing into a vector gives a scalartensor(5.)\>\>\> vector[3].item() # .item() gives the value as a Python number5.0\>\>\> sum = torch.tensor([2, 3]).sum()\>\>\> sumtensor(5)\>\>\> sum.size() torch.Size([])
Accumulating losses
β Consider the widely used pattern
total_loss += loss.data[0]
before 0.4.0.loss
was aVariable
wrapping a tensor of size(1,)
, but in 0.4.0loss
is now a scalar and has0
dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): useloss.item()
to get the Python number from a scalar.Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the righthandside of the above expression used to be a Python float, while it is now a zerodim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.
π Deprecation of
volatile
flagThe
volatile
flag is now deprecated and has no effect. Previously, any computation that involves aVariable
withvolatile=True
won't be tracked byautograd
. This has now been replaced by a set of more flexible context managers includingtorch.no_grad()
,torch.set_grad_enabled(grad_mode)
, and others.\>\>\> x = torch.zeros(1, requires\_grad=True)\>\>\> with torch.no\_grad():... y = x \* 2\>\>\> y.requires\_gradFalse\>\>\>\>\>\> is\_train = False\>\>\> with torch.set\_grad\_enabled(is\_train):... y = x \* 2\>\>\> y.requires\_gradFalse\>\>\> torch.set\_grad\_enabled(True) # this can also be used as a function\>\>\> y = x \* 2\>\>\> y.requires\_gradTrue\>\>\> torch.set\_grad\_enabled(False)\>\>\> y = x \* 2\>\>\> y.requires\_gradFalse
π
dtypes
,devices
and NumPystyle creation functionsIn previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example,
torch.cuda.sparse.DoubleTensor
was theTensor
type respresentingdouble
data type, living on CUDA devices, and with COO sparse tensor layout.π In this release, we introduce
torch.dtype
,torch.device
andtorch.layout
classes to allow better management of these properties via NumPystyle creation functions.π
torch.dtype
π Below is a complete list of available
torch.dtype
s (data types) and their corresponding tensor types.Data type torch.dtype
Tensor types 32bit floating point torch.float32
ortorch.float
torch.*.FloatTensor
64bit floating point torch.float64
ortorch.double
torch.*.DoubleTensor
16bit floating point torch.float16
ortorch.half
torch.*.HalfTensor
8bit integer (unsigned) torch.uint8
torch.*.ByteTensor
8bit integer (signed) torch.int8
torch.*.CharTensor
16bit integer (signed) torch.int16
ortorch.short
torch.*.ShortTensor
32bit integer (signed) torch.int32
ortorch.int
torch.*.IntTensor
64bit integer (signed) torch.int64
ortorch.long
torch.*.LongTensor
0οΈβ£ Use
torch.set_default_dtype
andtorch.get_default_dtype
to manipulate defaultdtype
for floating point tensors.π
torch.device
A
torch.device
contains a device type ('cpu'
or'cuda'
) and optional device ordinal (id) for the device type. It can be initilized withtorch.device('{device_type}')
ortorch.device('{device_type}:{device_ordinal}')
.If the device ordinal is not present, this represents the current device for the device type; e.g.,
torch.device('cuda')
is equivalent totorch.device('cuda:X')
whereX
is the result oftorch.cuda.current_device()
.π
torch.layout
π
torch.layout
represents the data layout of aTensor
. Currentlytorch.strided
(dense tensors) andtorch.sparse_coo
(sparse tensors with COO format) are supported.Creating
Tensor
sπ Methods that create a
Tensor
now also take indtype
,device
,layout
, andrequires_grad
options to specify the desired attributes on the returnedTensor
. For example,\>\>\> device = torch.device("cuda:1")\>\>\> x = torch.randn(3, 3, dtype=torch.float64, device=device) tensor([[0.6344, 0.8562, 1.2758], [0.8414, 1.7962, 1.0589], [0.1369, 1.0462, 0.4373]], dtype=torch.float64, device='cuda:1')\>\>\> x.requires\_grad # default is FalseFalse\>\>\> x = torch.zeros(3, requires\_grad=True)\>\>\> x.requires\_gradTrue
π
torch.tensor
torch.tensor
is one of the newly added tensor creation methods. It takes in array like data of all kinds and copies the contained values into a newTensor
. As mentioned earlier,torch.tensor
is the PyTorch equivalent of NumPy'snumpy.array
constructor. Unlike thetorch.*Tensor
methods, you can also create zerodimensionalTensor
s (aka scalars) this way (a single python number is treated as a Size in thetorch.*Tensor
methods). Moreover, if adtype
argument isn't given, it will infer the suitabledtype
given the data. It is the recommended way to create a tensor from existing data like a Python list. For example,\>\>\> cuda = torch.device("cuda")\>\>\> torch.tensor([[1], [2], [3]], dtype=torch.half, device=cuda) tensor([[1], [2], [3]], device='cuda:0')\>\>\> torch.tensor(1) # scalartensor(1)\>\>\> torch.tensor([1, 2.3]).dtype # type inferecetorch.float32\>\>\> torch.tensor([1, 2]).dtype # type inferecetorch.int64
We've also added more tensor creation methods. Some of them have
torch.*_like
and/ortensor.new_*
variants.0οΈβ£
torch.*_like
takes in an inputTensor
instead of a shape. It returns aTensor
with same attributes as the inputTensor
by default unless otherwise specified:\>\>\> x = torch.randn(3, dtype=torch.float64)\>\>\> torch.zeros\_like(x) tensor([0., 0., 0.], dtype=torch.float64)\>\>\> torch.zeros\_like(x, dtype=torch.int) tensor([0, 0, 0], dtype=torch.int32)
tensor.new_*
can also createTensor
s with same attributes astensor
, but it always takes in a shape argument:\>\>\> x = torch.randn(3, dtype=torch.float64)\>\>\> x.new\_ones(2) tensor([1., 1.], dtype=torch.float64)\>\>\> x.new\_ones(4, dtype=torch.int) tensor([1, 1, 1, 1], dtype=torch.int32)
To specify the desired shape, you can either use a tuple (e.g.,
torch.zeros((2, 3))
) or variable arguments (e.g.,torch.zeros(2, 3)
) in most cases.Name Returned Tensor
torch.*_like
varianttensor.new_*
variantπ torch.empty
unintialized memory β π torch.zeros
all zeros β π torch.ones
all ones β π torch.full
filled with a given value β π torch.rand
i.i.d. continuous Uniform[0, 1)
β π torch.randn
i.i.d. Normal(0, 1)
β π torch.randint
i.i.d. discrete Uniform in given range β π torch.randperm
random permutation of {0, 1, ..., n  1}
π torch.tensor
copied from existing data ( list
, NumPyndarray
, etc.)π torch.from_numpy
*from NumPy ndarray
(sharing storage without copying)π torch.arange
,π
torch.range
, and
πtorch.linspace
 uniformly spaced values in a given range    π torch.logspace
 logarithmically spaced values in a given range    π torch.eye
 identity matrix   *:
torch.from_numpy
only takes in a NumPyndarray
as its input argument.Writing deviceagnostic code
Previous versions of PyTorch made it difficult to write code that was device agnostic (i.e. that could run on both CUDAenabled and CPUonly machines without modification).
PyTorch 0.4.0 makes this easier in two ways:
 The
device
attribute of a Tensor gives thetorch.device
for all Tensors (get_device
only works for CUDA tensors)  π The
to
method ofTensors
andModules
can be used to easily move objects to different devices (instead of having to callcpu()
orcuda()
based on the context)
We recommend the following pattern:
# at beginning of the scriptdevice = torch.device("cuda:0" if torch.cuda.is\_available() else "cpu")...# then whenever you get a new Tensor or Module# this won't copy if they are already on the desired deviceinput = data.to(device) model = MyModule(...).to(device)
Tensors
π Full support for Advanced indexing
π PyTorch now has full support for advanced indexing, following numpy's advanced indexing rules. The following examples are now possible:
a = torch.rand(10, 10, 10, 10)# the indexing elements can have other shapes than 1b = a[[[3, 2]], :, [[1, 3]]]# broadcasting also supported in the indices, as well as lists,# negative indices, slices, elipses, numbersc = a[[1, 2], 2:4, :, [1]]# can also support tensors as indicesindex = torch.tensor([2, 4]) d = a[index]# and the indices can be on the GPU# or CPUe = a[index.cuda()] f = a.cuda()[index] mask = torch.rand(10) \> 0.5# we can now index with a mask that has fewer# dimensions than the indexing tensorc = a[mask, :5]
Fast Fourier Transform
 β Add new FFT methods #5856
 β Add
torch.stft
(short time Fourier transform) and hann/hamming/bartlett window functions. #4095  π Support arbitrary number of batch dimensions in *FFT #6528
π New and updated Torch operators
 β Added
torch.log2
andtorch.log10
#6272  β Added
torch.isnan
#5273  β Add
torch.reshape
, which is similar tonumpy.reshape
. It is roughly equivalent totensor.contiguous().view()
, but avoids copying in certain cases #5575  β Add CPU implementation of
torch.unique
, which outputs the unique elements of a Tensor #5503  β Add
torch.det
,torch.logdet
andtorch.slogdet
, for computing the (log)determinant of square 2D tensors. For negative determinants,torch.logdet
returnsnan
, whiletorch.slogdet
returns the sign of the logdeterminant and the log of the absolute value of the determinant. #3816 and #5393  β Add
nn.functional.gumbel_softmax
, which lets you use the reparametrization trick for discrete variables #3341  β Add
torch.take
andTensor.put_
. Those functions are equivalent to numpy.take and numpy.put, and are the base for full support of advanced indexing in PyTorch #3263  β Add
torch.randint
, similar tonumpy.random.randint
#6136  β Add
torch.diagonal
andtorch.diagflat
, similar tonumpy.diagonal
andnumpy.diagflat
. They are meant as a replacement fortorch.diag
, which handled both the cases of constructing a diagonal tensor as well as extracting the diagonal of a matrix #5622 β Add
torch.einsum
, equivalent tonumpy.einsum
. einsum allows you to perform operations using Einstein's notation. #5503a = torch.arange(0, 9).reshape(3, 3)# the following transposes ab = torch.einsum('ij>ji', (a,))
β Add
torch.expm1
, a numerically stableexp(x)1
for smallx
. #4350π Allow users to specify individual split sizes with
torch.split
#3837β Add
torch.where(condition, tensor1, tensor2)
that returns a tensors of elements selected fromtensor1
ortensor2
based oncondition
. #4259, #4259β Add
Tensor.norm(dim)
for sparse tensors. #4882Implement
torch.neg
for all types. #4075Implement gradient calculation for
torch.trtrs
. #3972Deprecate outofplace
Tensor.resize
andTensor.resize_as
. These have weird semantics and are hard to use correctly. Please use their inplace variantsTensor.resize_
andTensor.resize_as_
. #4886
π Rename
async
argument in.cuda()
tonon_blocking
π The
async
keyword argument in conversion calls is now deprecated in PyTorch, and it has been replaced bynon_blocking
. This was necessary becauseasync
will be a keyword in Python 3.7Neural Networks
A new autograd container that lets you trade compute for memory
The new
checkpoint
container allows you to only store a subset of the outputs necessary for backpropagation. If an output is missing (to save memory), thecheckpoint
container will recompute the intermediate outputs from the closest checkpoint, so that memory usage can be reduced (with an increase in computation time).
Here is an example:# inputinput = torch.rand(1, 10)# suppose we have a very deep modellayers = [nn.Linear(10, 10) for \_ in range(1000)] model = nn.Sequential(\*layers) output = model(input)
The above model uses a lot of memory, because it needs to keep the intermediate values of every operation for backpropagation.
checkpoint
lets your reduce the memory requirements:# create the input tensors and set the requires\_grad=True# NOTE: the requires\_grad=True for the input is a current# limitation of checkpointing. At least one of the # model inputs should have requires\_grad=True. # If you don't do it, you might have empty gradients.input = torch.rand(1, 10, requires\_grad=True) layers = [nn.Linear(10, 10) for \_ in range(1000)]# define function that will define where# we will checkpoint and store# intermediate gradients. In this case,# we will only store one intermediate# gradient, in the middle of the# modeldef run\_first\_half(\*args): x = args[0] for layer in layers[:500]: x = layer(x) return xdef run\_second\_half(\*args): x = args[0] for layer in layers[500:1]: x = layer(x) return x# now uses the new checkpoint functionalityfrom torch.utils.checkpoint import checkpoint x = checkpoint(run\_first\_half, input) x = checkpoint(run\_second\_half, x)# last output need to be run without checkpointx = layers[1](x) x.sum.backward() # works!
For sequential modules (which can have arbitrary blocks inside), a helper function
checkpoint_sequential
is provided, which takes care of the most common usecases:input = torch.rand(1, 10, requires\_grad=True) layers = [nn.Linear(10, 10) for \_ in range(1000)] model = nn.Sequential(\*layers)from torch.utils.checkpoint import checkpoint\_sequential# split in two blocksnum\_segments = 2x = checkpoint\_sequential(model, num\_segments, input) x.sum().backward() # works!
bottleneck  a tool to identify hotspots in your code
torch.utils.bottleneck
(#5216, #6425) is a tool that can be used as an initial step for
debugging bottlenecks in your program. It summarizes runs of your script with
π the Python profiler and PyTorchβs autograd profiler. See the bottleneck docs for more details.β¬οΈ reduce=False Losses
π As of this release, all of our loss functions support the
reduce
keyword. Specifyingreduce=False
gives a Tensor per unit of loss instead of a single reduced loss. #4924, #5346, #5646, #4231, #4705, #5680π New modules and module improvements
 β Add
DistributedDataParallelCPU
. This is similar toDistributedDataParallel
, but with specific support for models running on the CPU (contrary toDistributedDataParallel
, which targets GPU), and supportsmpi
,gloo
andtcp
backends #5919.  β Add Group Normalization (
nn.GroupNorm
), an alternative to batch normalization that doesn't suffer from the same issues asBatchNorm
for small batch sizes  β Add Layer Normalization (
nn.LayerNorm
), an alternative for batch normalization often used in NLP tasks. #4922  β Add Local Response Normalization (
nn.LocalResponseNorm
). #4922  π
MaxPool3d
now supports double backwards. MaxPool3d and MaxUnpool3d now use indices consistent with the rest of the pooling layers. #5328  π All loss functions now support a reduce argument to return a batch of losses. #264
 β Add util to clip gradient value in torch.nn.utils.clip_grad and add param to He initialization scheme in
torch.nn.init
. #6173  π Renamed
torch.nn.init.*
methods to have an underscore in the end, as they operate inplace, and deprecate the old versions 6093  β Added support for returning dictionaries in
DataParallel
#6113  β Added support for ND tensors in
torch.nn.Bilinear
#5764  β Add
Embedding.from_pretrained
factory. This allows to initialize an Embedding layer with an existing tensor, bypassing the initial random initialization of its weights.  You can now slice
nn.Sequential
,nn.ModuleList
, andnn.ParameterList
#4491  Registered
nn.Module
integer parameters and buffers are now immune tomodule.float()
,module.double()
module.half()
calls. #3820
torch.distributions
π
torch.distributions
has expanded to include 24 basic probability distributions:Bernoulli
,Beta
,Binomial
,Categorical
,Cauchy
,Chi2
,Dirichlet
,Exponential
,FisherSnedecor
,Gamma
,Geometric
,Gumbel
,Laplace
,LogNormal
,Multinomial
,MultivariateNormal
,Normal
,OneHotCategorical
,Pareto
,Poisson
,RelaxedBernoulli
,RelaxedOneHotCategorical
,StudentT
, andUniform
.π The
Distribution
interface has expanded to include many methods including.cdf()
,.icdf()
,.mean()
,.variance()
,.entropy()
, and.perplexity()
. Distributions now split tensor dimensions intosample_shape
+batch_shape
+event_shape
. Most continuous distributions now also implement a differentiable.rsample()
method to compute pathwise derivatives aka the reparameterization trick (check.has_rsample
for availability):\>\>\> loc = torch.tensor(0., requires\_grad=True)\>\>\> scale = torch.tensor(1., requires\_grad=True)\>\>\> samples = Normal(loc, scale).rsample(sample\_shape=(1000,))\>\>\> loss = (samples  0.5).pow(4).mean() # average over 1000 monte carlo samples\>\>\> grad(loss, [loc, scale]) (tensor(7.5092), tensor(15.2704))
π Most discrete distributions implement an
.enumerate_support()
method to make it easy to sum over all possible sample values (check.has_enumerate_support
for availability).π
kl_divergence
is defined for many pairs of distributions, e.g.\>\>\> x = torch.tensor(1.0, requires\_grad=True)\>\>\> kl = kl\_divergence(Uniform(x, x), Normal(0., 1.))\>\>\> grad(kl, [x])[0] tensor(0.6667)
Distribution Transforms
π New distributions can be created by combining
TransformedDistribution
with any number ofTransform
objects from thetorch.distributions.transforms
library, including:ExpTransform
,PowerTransform
,SigmoidTransform
,AbsTransform
,AffineTransform
,SoftmaxTransform
,StickBreakingTransform
,LowerCholeskyTransform
, and their inverses via the.inv
property.Distribution Constraints
π Distributions provide metadata about the constraints of their
.support
and about their arguments (.arg_constraints
). TheseConstraint
objects are registered with transforms usingtransform_to()
andbiject_to()
. Together constraints and transforms make it easy to specify new distributions in a generic way\>\>\> scale = torch.tensor(1., requires\_grad=True)\>\>\> p = Normal(0., scale)\>\>\> assert p.arg\_constraints['scale'] == constraints.positive\>\>\> prior = TransformedDistribution(Normal(0., 1.),... transform\_to(constraints.positive))
Constraints in the
torch.distributions.constraints
library include:boolean
,greater_than(lower_bound)
,integer_interval(lower_bound, upper_bound)
,interval(lower_bound, upper_bound)
,lower_cholesky
,lower_triangular
,nonnegative_integer
,positive
,positive_definite
,positive_integer
,real
,real_vector
,simplex
, andunit_interval
.Distributed
π· Helper utility for launching Distributed Training jobs
π· We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leveragesDistributedDataParallel
on either singlenode multiplenodes, we can make use of torch.distributed launch as followspython m torch.distributed.launch my_script.py arg1 arg2 arg3
π¦ The script simplifies day to day usability of the
distributed
package.π You can read about it's usage here: http://pytorch.org/docs/stable/distributed.html#launchutility
A new distributed backend based on NCCL 2.0
PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend viatorch.distributed.init\_process\_group("nccl")
Other distributed improvements
 π Coalesce many small broadcasts to improve performance #4978
 β Add mixedprecision support for distributed training #4891
 π Release NCCL distributed backend. Previously it was marked as
experimental
. #4921  π Enable Infiniband support for Gloo data channel with automatic IB device detection #4795
C++ extensions
Previously, the official way of writing extensions using C or CUDA for custom modules was through the cffi extension. The drawback of this method was that it required a separate step for compiling the CUDA kernels, which could be a bit messy.
π PyTorch now provides a better system for writing your own C++ / CUDA extensions. Example implementations using this new extension support can be found in the pytorch/cpp_extensions repo.
We provide two compilation modes:
 ahead of time compilation: you write a
setup.py
script using the newCppExtension
orCUDAExtension
, which is an extension ofsetuptools.Extension
module;  justintime compilation: you pass the list of C++ / CUDA files that you want to compile to
torch.utils.cpp_extension.load
, and it will compile on the fly and cache the libraries for you. Here is an example illustrating how easy it is to implement an extension:
In C++
// my\_implementation.cpp#include \<torch/torch.h\>#include \<unordered\_set\>// can use templates as well. But let's keep it// simpleusing scalar\_t = float; at::Tensor unique\_float(at::Tensor input\_) { // only works for floatsAT\_ASSERT(input\_.type().scalarType() == at::ScalarType::Float, "input must be a float tensor"); // and CPU tensorsAT\_ASSERT(!input\_.type().is\_cuda(), "input must be a CPU tensor"); // make the input contiguous, to simplify the implementation at::Tensor input = input\_.contiguous(); // get the pointer that holds the datascalar\_t\* input\_data = input.data\<scalar\_t\>(); // let's use a function from the std library to implement// the unique function std::unordered\_set\<scalar\_t\> set(input\_data, input\_data + input.numel()); // create the output tensor, with size set.size() at::Tensor output = input.type().tensor({static\_cast\<int64\_t\>(set.size())}); scalar\_t\* output\_data = output.data\<scalar\_t\>(); // copy the content of the set to the output tensorstd::copy(set.begin(), set.end(), output\_data); return output; }// this defines the functions exposed to PythonPYBIND11\_MODULE(TORCH\_EXTENSION\_NAME, m) { m.def("unique\_float", &unique\_float, "Unique for float tensors"); }
And then in Python
import torchfrom torch.utils.cpp\_extension import load as load\_ext# pass the source files, they will be compiled on the fly # and will return a python module\_C = load\_ext('my\_unique\_lib', sources=['my\_implementation.cpp'])# now can use the functions implemented in C++unique = \_C.unique\_float a = torch.tensor([1.0, 2.0, 1.0])print(unique(a))# tensor([2., 1.])
π Windows support
π PyTorch now officially supports Windows. We provide precompiled Conda binaries and pip wheels for Python 3.5 and 3.6.
π§ PyTorch on Windows doesn't supportdistributed
training and might be a tad bit slower than Linux / OSX because Visual Studio supports an older version of OpenMP.π As always, you can use the commands at http://pytorch.org to install PyTorch on Windows
π We have an FAQ that answers most questions you might have around Windows here: http://pytorch.org/docs/stable/notes/windows.htmlONNX Improvements
π New ONNX operators
 π Support export
torch.max(input, dim)
andtorch.min(input, dim)
#6220  β Add symbolic for
ReLU
to support exporting to ONNX #5759  β Add
sum
,prod
,sqrt
and improvelog_softmax
#4579  β Add ONNX support for
InstanceNorm
#4626  β Add ONNX symbolic for
Elu
#3453  β Add ONNX symbolic for
UpsamplingNearest2d
#3450
π Improvements
 π¨ Print source location when ONNX export fails for a node #5652
 Export onnx protobuf bindings to python #6651
 π Support
output_padding
inConvTranspose
#4583
π Better RNN support
PyTorch can now export a subset of RNNs to ONNX #4409
 β Add Elman RNN export to ONNX #4613
 π Support batchfirst in ONNX export of padded sequences #5360
 Bidirectional Elman RNN export to ONNX #5120
 π Handle sequence lengths correctly when exporting RNNs to ONNX #4695
 π Support GRU export to ONNX #4390
π Bugfixes
 π Fix a bug in ONNX symbolic of 3d average pooling #6101
 π Fix onnx export of replication/reflection pad #4263
Miscellaneous improvements
implement
__dir__
for Tensors, so that editors can automatically autocomplete and query for the possible fields in Tensorsβ Add
numpy()
andfrom_numpy()
toHalfTensor
Enable
TensorDataset
to have any number of input tensors.Add
padding_value
totorch.nn.utils.rnn.pad_sequence
Add
total_length
option topack_padded_sequence
, which is useful when usingDataParallel
, as we can ensure that we have sequences of the same length.π Improve numerical precision of
torch.arange
, making it consistent withnumpy.arange
π
torch.load()
andtorch.save()
support arbitrary filelike objectπ
torch.nn.functional.grid_sample
now supports 2D (spatial) and 3D (volumetric) inputsπ set python random seed in
DataLoader
workers, in order to improve experiment reproducibilityAdd
__delitem__
tonn.Sequential
. Now one can delete arbitrary elements of ann.Sequential
.For example:
model = nn.Sequential(nn.Linear(2, 2), nn.ReLU(), nn.Linear(2, 2))del model[1] # deletes nn.ReLU
ReduceLROnPlateau
is now serializable #5300β Add option to flush denormal numbers on CPU. #5294
PyTorch now exposes the gradients of conv1d, conv2d and conv3d with respect to the input and the weights #5408
Add support for calling
pack_padded_sequence
with either list or with a Tensor #5133π Support negative indexing for
padding_idx
innn.Embedding
#4496Implement backward pass for
pack_padded_sequence
#4512Add
nn.utils.rnn.pad_sequence
andnn.utils.rnn.pack_sequence
to pad lists of variable length Tensors with0
and to pack a list of variable length Tensors.Add
torch.cuda.memory_cached
,torch.cuda.max_memory_cached
,torch.cuda.memory_allocated
, andtorch.cuda.max_memory_allocated
methods
for checking CUDA memory usage #4511π Allow viewing on noncontiguous tensors if the new view size is compatible with the tensor's original size and stride. #4062
π
NLLLoss
andCrossEntropyLoss
now support more than 2 dimensions. #4654β Add an option to not show
model_zoo
download progress bar #4135You can now assign modules to indices of
nn.Sequential
. #4931You can create tensors with a numpy
np.longlong
array #4367π Change the autograd execution order to use good heuristics. This greatly improves memory usage for large models. #4746
β Add AMSgrad mode to
Adam
andSparseAdam
optmizers. #4034π Better
torch.autograd.profiler
support for CUDA profiling using thecudaEvent
API. #3734torch.set_num_threads
also sets the respective MKL option so you won't need to use an environment variable to control it. #4949π Performance improvements
 Speed up CPU
nn.EmbeddingBag
, making training overall 30% faster #5433  π Move
nn.MarginRankingLoss
,nn.CosineEmbeddingLoss
,nn.HingeEmbeddingLoss
, andnn.TripletMarginLoss
from Python to our ATen backend, resulting in some cases up to a 3x performance gains.
#5346, #5646, #5080, #5680  Implement
pin_memory()
as a NativeFunction #4094  πΎ Save
self.numel()
for backward computation instead ofself
to save memory #5747  π Rearrange dimensions for pointwise operations for up to 10x better performance in one case. #4174
 Vectorize
normal_
for a 56x speed up in a small case #4312  π Allowing usage of GPU Direct within PyTorch for the Broadcast operation #4183
 Speedup
nn.Linear
for the 3D input case #5279  Speed up
Conv3D
on the CPU by parallelizingvol2col
andcol2vol
#4824  β Add AVX2 implementation for sigmoid function, showing around 10x speedup #5010
 π Use fast integer division algorithm to avoid division ops inside kernels. #5054
 π Improve occupancy for CUDA random number generation #5710
 β Add optimization to norm for common norms #5722
 β Add a fast fused GLU backward #5782
 β‘οΈ Optimize unique sorting by using
std::vector+sort
instead ofstd::set
, giving up to 5x speedup. #5913  Speed up sum over a dimension #6026
 Enable MKLDNN convolution forward and backward. #6062
 Parallelize noncontiguous pointwise operations with OpenMP #2764
 β Add cudnn Tensor Core ops to RNNs for Volta #3409
 π² Vectorize
exp
,log
,sin
,cos
#6078  Reuse intermediate results over multiple backwards grad_inputs #3526
Distributed
 π DistributedDataParallel: 10% of NCCL backend perf improvements with mixedprecision support #5064
 π Slightly improve DistributedDataParallel (singleGPU binding) multiprocess distributed training performance #4870
π Bug fixes
torch operators
 π Improve
torch.digamma
precision near poles #6517  π Fix incorrect behavior of
Tensor.random_
on negative inputs #6463  π Fix undefined behavior in backward pass for
tensor.permute(dims)
with negative dims #5945  π Fix integer overflow in
torch.remainder
operator (it would break with a divisor above2**48
) #5906  π Fix memory leak in
torch.bmm
#5744  β Make dimension checker of
scatter_add_
consistent withscatter_
's #5659  π Fix CPU
torch.multinomial
with noncontiguous probability tensor input (previously, it would overwrite input data)#5093  π Fix CUDA
torch.multinomial
using incorrect strides and being able to select zeroprobability events. #5774, #5238  π Support empty index tensor for
index_select
#3429  π Support empty indices tensor in CUDA
Tensor.put_
#4486  π Improve stability of
torch.cat
with empty tensors #3602, #5971, #5819  π Fix
torch.fft
in the case where any of the input dimensions is not aligned #6118  π Improve the CUDA btrifact error message #5644
 Return zeros for eigenvector tensor when not requested in
torch.symeig
#3411  π Fix
torch.btrifact
on tensors. #4318  π Fix
torch.pstrf
on tensors. #4883  π Fix memory leak in
torch.median
6889  π Fix SVD backward on nonsquare matrices when
some=False
6870
core
 Detect reinitialization of
_C
shared library that would often result in segfaults on exit #6232  π Fix indexing with all zero ByteTensors #3926
 0οΈβ£ Only allow dense floatingpoint types as the default tensor type. #5674
 π Initialize CUDA before setting CUDA tensor types as default to prevent crash #4788
 π Fix a bug where
from_dlpack
fails if CUDA is not initialized. #4182  π Fix crash in creating a CUDA tensor with a numpy array #5850
 π Fix broken sharing of empty tensor in multiprocessing on some OSes #6229
autograd
 βͺ Restore allow_unused functionality: throw error when differentiated input is unused or unreachable. #6553
 Fix
output_nr
not being incremented correctly. This caused crashes in the backward pass of operations that don'trequires_grad
on some inputs. #4812  π Fix nvprof parsing in the
torch.autograd.profiler
#5840
nn layers
 π Support only specifying size in certain dimension for adaptive pooling #3127
 π Fix reflection padding boundary checks to not cause invalid memory access #6438
 π Improve error messages for
NLLLoss
. #5299, #6072  π Fix
kl_div
backward on CUDA. Previously it would not respectgradOutput
when computinggradInput
. #5814  π Fix incorrect
bias
size assert forLinear
#5992  π Fix incorrect
nn.functional.convNd
andnn.functional.conv_transposeNd
error message #5701  Check that shape for input and target matches instead of number of elements for some loss functions #5085
 π Fix
torch.diag
backward returning square grad with nonsquare input #4538  π Fix convolution type mismatch error message #5815
 β Add
align_corners
option to linearly interpolating upsampling and make the default upsampling behavior more consistent with other frameworks #5927  Prevent numerical issues with
poisson_nll_loss
when log_input=False #3336
CUDA
 Ensure convolution weights are contiguous to fix CUDA
ConvTranspose
double backward #4543  π Fix CUDA double backwards #4460
π sparse
 π Fix embedding with
sparse=True
#4686  π Fix sparse embedding backward when input contains only
padding_idx
#6211  π Handle copying empty sparse tensors to/from CPU, GPU. #5361
dataloader
 β Add argument checks to the
torch.utils.data.Sampler
classes, fixing a bug whereDataLoader
tries to load the entire dataset on nonintegerbatch_size
. #6249  Set
dataloader.batch_size = None
when batch_sampler is given, fixing a bug whereDataLoader
would reportbatch_size
as1
. #6108  π Improve signal handling in
DataLoader
#4643  Ignore
FileNotFoundError
when shutting down #5380  π Make preprocessing deterministic #4640
optim
 β‘οΈ Cast tensors when loading optimizer state dicts to improve usability #3658
 List model parameters in deterministic order to improve stability of
load_state_dict()
#6031  β Add parameter range checks for all optimizers #6000
 π Fix
AMSGrad
mode forSparseAdam
#4314
distributed and multigpu
 Major Core Changes

v0.3.1 Changes
February 14, 2018Binaries
 β Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
 π Stop binary releases for CUDA 7.5
 β Add CPUonly binary releases that are 10x smaller in size than the full binary with CUDA capabilities.
As always, links to our binaries are on http://pytorch.org
π New features
 β Add Cosine Annealing Learning Rate Scheduler#3311
 β add
reduce
argument toPoissonNLLLoss
to be able to compute unreduced losses #3770  Allow
target.requires_grad=True
inl1_loss
andmse_loss
(compute loss wrttarget
) #3876  β Add
random_split
that randomly splits a dataset into nonoverlapping new datasets of given lengths #4435  π Introduced scopes to annotate ONNX graphs to have better TensorBoard visualization of models #5153
Allowmap_location
intorch.load
to be a string, such asmap_location='cpu'
ormap_location='cuda:2'
#4203
π Bug Fixes
Data Loader / Datasets / Multiprocessing
 π· Made DataLoader workers more verbose on bus error and segfault. Additionally, add a
timeout
option to the DataLoader, which will error if sample loading time exceeds the given value. #3474  π DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of
fork
syscall. Now, each worker will have it's RNG seed set tobase_seed + worker_id
wherebase_seed
is a random int64 value generated by the parent process. You may usetorch.initial_seed()
to access this value inworker_init_fn
, which can be used to set other seeds (e.g. NumPy) before data loading.worker_init_fn
is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018  β Add additional signal handling in DataLoader worker processes when workers abruptly die.
 π· Negative value for n_workers now gives a ValueError #4019
 π fixed a typo in
ConcatDataset.cumulative_sizes
attribute name #3534  0οΈβ£ Accept longs in default_collate for dataloader in python 2 #4001
 Reinitialize autograd engine in child processes #4158
 π Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196
CUDA / CuDNN
 π allow cudnn for fp16 batch norm #4021
 π Use
enabled
argument intorch.autograd.profiler.emit_nvtx
(was being ignored) #4032  π Fix cuBLAS arguments for fp16
torch.dot
#3660  Fix CUDA index_fill_ boundary check with small tensor size #3953
 π Fix CUDA Multinomial checks #4009
 π Fix CUDA version typo in warning #4175
 π Initialize cuda before setting cuda tensor types as default #4788
 β Add missing lazy_init in cuda python module #4907
 Lazy init order in set device, should not be called in getDevCount #4918
 π Make torch.cuda.empty_cache() a noop when cuda is not initialized #4936
CPU
 Assert MKL ld* conditions for ger, gemm, and gemv #4056
torch operators
 π Fix
tensor.repeat
when the underlying storage is not owned bytorch
(for example, coming from numpy) #4084  β Add proper shape checking to torch.cat #4087
 Add check for slice shape match in index_copy_ and index_add_. #4342
 π Fix use after free when advanced indexing tensors with tensors #4559
 π Fix triu and tril for zerostrided inputs on gpu #4962
 π Fix blas addmm (gemm) condition check #5048
 π Fix topk work size computation #5053
 π Fix reduction functions to respect the stride of the output #4995
 π Improve float precision stability of
linspace
op, fix 4419. #4470
autograd
 π Fix python gc race condition with THPVariable_traverse #4437
nn layers
 π Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
π Fix cosine_similarity's output shape #3811  β Add rnn args check #3925
 NLLLoss works for arbitrary dimensions #4654
 More strict shape check on Conv operators #4637
 π Fix maxpool3d / avgpool3d crashes #5052
 π Fix setting using running stats in InstanceNorm*d #4444
MultiGPU
 π Fix DataParallel scattering for empty lists / dicts / tuples #3769
 π Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
 Broadcast output requires_grad only if corresponding input requires_grad #5061
core
 β Remove hard file offset reset in load() #3695
 Have sizeof account for size of stored elements #3821
 π Fix undefined FileNotFoundError #4384
 make torch.set_num_threads also set MKL threads (take 2) #5002
others
 π Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656
π Performance improvements
 slightly simplified math in IndexToOffset #4040
 π improve performance of maxpooling backwards #4106
 β Add cublas batched gemm support. #4151
 π Rearrange dimensions for pointwise operations for better performance. #4174
 π Improve memory access patterns for index operations. #4493
 π Improve CUDA softmax performance #4973
 π Fixed double memory accesses of several pointwise operations. #5068
π Documentation and UX Improvements
 π Better error messages for blas ops with cuda.LongTensor #4160
 β Add missing trtrs, orgqr, ormqr docs #3720
 π change doc for Adaptive Pooling #3746
 π Fix MultiLabelMarginLoss docs #3836
 π More docs for Conv1d Conv2d #3870
 π Improve Tensor.scatter_ doc #3937
 0οΈβ£ [docs] rnn.py: Note zero defaults for hidden state/cell #3951
 π Improve Tensor.new doc #3954
 π Improve docs for torch and torch.Tensor #3969
 β Added explicit tuple dimensions to doc for Conv1d. #4136
 π Improve svd doc #4155
 Correct instancenorm input size #4171
 π Fix StepLR example docs #4478

v0.3.0 Changes
December 05, 2017Table of contents
 π₯ Breaking changes: removed
reinforce()
 π New features
 Unreduced losses
 A profiler for the autograd engine
 More functions support Higher order gradients
 New features in Optimizers
 New layers and nn functionality
 New Tensor functions and Features
 Other additions
 API changes
 π Performance improvements
 Big reduction in framework overhead (helps small models)
 4x to 256x faster Softmax/LogSoftmax
 More...
 Framework Interoperability
 DLPack Interoperability
 Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
 π Bug Fixes (a lot of them)
π₯ Breaking changes
π Stochastic functions, i.e.
Variable.reinforce()
were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid bookkeeping of sampled values. In practice, users were still bookkeeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.π¦ We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
probs = policy\_network(state) action = probs.multinomial() next\_state, reward = env.step(action) action.reinforce(reward) action.backward()
This is the new equivalent code:
probs = policy\_network(state)# NOTE: categorical is equivalent to what used to be called multinomialm = torch.distributions.Categorical(probs) action = m.sample() next\_state, reward = env.step(action) loss = m.log\_prob(action) \* reward loss.backward()
π New features
Unreduced losses
Now, Some loss functions can compute persample losses in a minibatch
 0οΈβ£ By default PyTorch sums losses over the minibatch and returns a single scalar loss. This was limiting to users.
 Now, a subset of loss functions allow specifying
reduce=False
to return individual losses for each sample in the minibatch  Example:
loss = nn.CrossEntropyLoss(..., reduce=False)
 π Currently supported losses:
MSELoss
,NLLLoss
,NLLLoss2d
,KLDivLoss
,CrossEntropyLoss
,SmoothL1Loss
,L1Loss
 π More loss functions will be covered in the next release
An inbuilt Profiler in the autograd engine
We built a lowlevel profiler to help you identify bottlenecks in your models
Let us start with an example:
>>> x = Variable(torch.randn(1, 1), requires_grad=True) >>> with torch.autograd.profiler.profile() as prof: ... y = x ** 2 ... y.backward() >>> # NOTE: some columns were removed for brevity ... print(prof)    Name CPU time CUDA time    PowConstant 142.036us 0.000us N5torch8autograd9GraphRootE 63.524us 0.000us PowConstantBackward 184.228us 0.000us MulConstant 50.288us 0.000us PowConstant 28.439us 0.000us Mul 20.154us 0.000us N5torch8autograd14AccumulateGradE 13.790us 0.000us N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a specialnvprof
prefix. For example:nvprof profilefromstart off o trace_name.prof  python <your arguments> # in python >>> with torch.cuda.profiler.profile(): ... model(x) # Warmup CUDA memory allocator and profiler ... with torch.autograd.profiler.emit_nvtx(): ... model(x)
π¨ Then, you can load
trace_name.prof
in PyTorch and print a summary profile report.>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof') >>> print(prof)
π Read additional documentation here
Higher order gradients
β Added higherorder gradients support for the following layers
 ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
 π PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
 MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
 DataParallel
β‘οΈ Optimizers
 π optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
 In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
 Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.
π New layers and nn functionality
 β Added AdpativeMaxPool3d and AdaptiveAvgPool3d
 β Added LPPool1d
 π F.pad now has support for:
 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
 constant padding on nd signals
 π¦ nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in
nearest
andlinear
modes.  π grid_sample now allows padding with the border value via
padding_mode="border"
.grid_sample
expects a grid in the range of[1, 1]
, and if the values are out of these bounds, padding with the value0.0
is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.  Introducing
nn.utils.parameters_to_vector
andnn.utils.vector_to_parameters
parameters_to_vector
takesnet.parameters()
and return a 1D vector that contains all the parametersvector_to_parameters
takes a vector of flattened parameters and copies the values over to a network's parameters Convenient for some reinforcement learning algorithms, such as crossentropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
π Allow user to not specify certain input dimensions for
AdaptivePool*d
and infer them at runtime. For example:
target output size of 10x7m = nn.AdaptiveMaxPool2d((None, 7))
DataParallel container on CPU is now a noop (instead of erroring out)
π New Tensor functions and features
 Introduced
torch.erf
andtorch.erfinv
that compute the error function and the inverse error function of each element in the Tensor.  β adds broadcasting support to bitwise operators
 β Added
Tensor.put_
andtorch.take
similar tonumpy.take
andnumpy.put
. The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices.  The put function copies value into a tensor also using linear indices.
 Differences from
numpy
equivalents: numpy.take
has an optional axis argument, which behaves likeindex_select
. Thisaxis
argument is not yet present.numpy.put
repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
 The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
 β add
zeros
andzeros_like
for sparse Tensors.  1element Tensors can now be casted to Python scalars. For example:
int(torch.Tensor([5]))
works now.
Other additions
Added
torch.cuda.get_device_name
andtorch.cuda.get_device_capability
that do what the names say. Example:>>> torch.cuda.get_device_name(0)'Quadro GP100'>>> torch.cuda.get_device_capability(0) (6, 0)
If one sets
torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithmstorch.cuda_get_rng_state_all
andtorch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having longrunning ipython notebooks while sharing the GPU with other processes.
API changes
softmax
andlog_softmax
now take adim
argument that specifies the dimension in which slices are taken for the softmax operation.dim
allows negative dimensions as well (dim = 1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined onVariable
 β Remove all instances of
device_id
and replace it withdevice
, to make things consistent torch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when usingtorch.autograd.grad
in large graphs with lists of inputs / outputs
For example:x, y = Variable(...), Variable(...) torch.autograd.grad(x * 2, [x, y]) # errorstorch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
pad_packed_sequence
now allows apadding_value
argument that can be used instead of zeropaddingDataset
now has a+
operator (which usesConcatDataset
). You can do something likeMNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence,src
is optional).recv
returns the rank of the sender.β adds
zero_()
toVariable
Variable.shape
returns the size of the Tensor (now made consistent with Tensor)torch.version.cuda
specifies the CUDA version that PyTorch was compiled withβ Add a missing function
random_
for CUDA.torch.load and torch.save can now take a
pathlib.Path
object, which is a standard Python3 typed filepath objectIf you want to load a model's
state_dict
into another model (for example to finetune a pretrained network),load_state_dict
was strict on matching the key names of the parameters. Now we provide astrict=False
option toload_state_dict
where it only loads in parameters where the keys match, and ignores the other parameter keys.β added
nn.functional.embedding_bag
that is equivalent tonn.EmbeddingBag
π Performance Improvements
 π The overhead of
torch
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speedsup models that are very small, such as small LSTMs and other common models seen in NLP.  π² softmax and log_softmax are now 4x to 256x faster on the GPU after rewriting the gpu kernels
 π 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
 nn.Embedding's renorm option is much faster on the GPU. For embedding dimensions of
100k x 128
and a batch size of 1024, it is 33x faster.  All pointwise ops now use OpenMP and get multicore CPU benefits
 β Added dedicated CUDA kernels for group convolutions where
groups == nInputPlane
(depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the benchmark table for more details as well as this table.  π Fixed
optim.SGD
's memory usage for sparse gradients (for ex.nn.Embedding(..., sparse=True)
), reducing the usage on a userprovided test script by 10x.  Optional NNPack integration for faster CPU convolutions (not part of binaries)
 β¬οΈ Reduce overhead of broadcasting if Tensors aren't broadcastable
torch.nn.utils.weight_norm
over the rightmost dimensions is faster Backward of
torch.norm
is sped up by ~1.5x  Improve the performance of
pack_padded_sequence
 β Add a singleargument version of
torch.arange
. For exampletorch.arange(10)
Framework Interoperability
DLPack Interoperability
DLPack Tensors are crossframework Tensor formats. We now have
torch.utils.to_dlpack(x)
andtorch.utils.from_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.Model exporter to ONNX
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNetlike and RNNlike (static graphs) can now be shipped to the ONNX format.
π There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.
π The operations supported in this release are:
 add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
 expand (only when used before a broadcasting ONNX operator; e.g., add)
 prelu (single weight shared among input channels not supported)
 threshold (nonzero threshold/nonzero value not supported)
 Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
 elu, leaky_relu, glu, softmax, log_softmax, avg_pool2d
 unfold (experimental support with ATenCaffe2 integration)
 Embedding (no optional arguments supported)
 RNN
 FeatureDropout (training mode not supported)
 Index (constant integer and tuple indices supported)
Usability Improvements
 More cogent error messages during indexing of Tensors / Variables
π₯ Breaking changes  β Add proper error message for specifying dimension on a tensor with no dimensions
 π better error messages for Conv*d input shape checking
 More userfriendly error messages for LongTensor indexing
 π Better error messages and argument checking for Conv*d routines
 Trying to construct a Tensor from a Variable fails more appropriately
 β If you are using a PyTorch binary with insufficient CUDA version, then a
warning
is printed to the user.  Fixed incoherent error messages in
load_state_dict
 π Fix error message for type mismatches with sparse tensors
π Bug fixes
torch
 π Fix CUDA lazy initialization to not trigger on calls to
torch.manual_seed
(instead, the calls are queued and run when CUDA is initialized)
Tensor
 if
x
is 2D,x[[0, 3],]
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can dox[[0, 3]]
 π
x.sort(descending=True)
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.  Tensor constructors with numpy input:
torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
 torch will now copy the contents of the array in a storage of appropriate type.
 If types match, it will share the underlying array (nocopy), with equivalent semantics to initializing a tensor with another tensor.
 On CUDA,
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
will now work by making a copy.
ones_like
andzeros_like
now create Tensors on the same device as the original Tensor π
torch.multinomial
on the CPU would reshape the inputprob_dist
inplace. Fixed this to make sure theprob_dist
input's shape is unchanged after the call tomultinomial
expand
andexpand_as
allow expanding an empty Tensor to another empty Tensor when
[..., None, ...]
was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.  π Fix exponential distribution implementation to never sample infinity  cuRAND returns numbers in (0, 1]
 π torch.HalfTensor supports
numpy()
andtorch.from_numpy
 β Add additional size checking for
torch.scatter
 π fix
torch.tril
andtorch.triu
on the GPU for storageoffset Tensors (would return incorrect result).  π Fix a memory leak in CUDA qr decomposition
 π Fix streamawareness issues in THCUNN kernels
 π Fix kwargs parsing in
torch.topk
 π Fixed
random_
on CPU (which previously had a max value of 2^{32)} for DoubleTensor and LongTensor  π Fix
ZeroDivisionError: float division by zero
when printing certain Tensors  π
torch.gels
whenm > n
had a truncation bug on the CPU and returned incorrect results. Fixed.  β Add a check in tensor.numpy() that checks if no positional arguments are passed
 π Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is
contiguous
any
andall
work on empty Tensors on the cpu (previously errored out) π Fix
symeig
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.  π Improved the numerical stability of
torch.var
andtorch.std
by using Welford's algorithm  The Random Number Generator returned
uniform
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug). Now, all
uniform
sampled numbers will return within the bounds[0, 1)
, across all types and devices
 Now, all
 π Fix
torch.svd
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)  π Allows empty index Tensor for
index_select
(instead of erroring out)  Previously when
eigenvector=False
,symeig
returns some unknown value for the eigenvectors. Now we zero them out.
π sparse
 π Fix bug with 'coalesced' calculation in sparse 'cadd'
 π Fixes
.type()
not converting indices tensor.  π Fixes sparse tensor coalesce on the GPU in corner cases
autograd
 π Fixed crashes when calling backwards on leaf variable with requires_grad=False
 π fix bug on Variable
type()
around nondefault GPU input.  when
torch.norm
returned0.0
, the gradient wasNaN
. We now use the subgradient at0.0
, so the gradient is0.0
.  π Fix an correctness issue with advanced indexing and higherorder gradients
 π
torch.prod
's backward was failing on the GPU due to a type error, fixed.  Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
 Variable.cuda() and Tensor.cuda() are consistent in kwargs options
optim
 β±
torch.optim.lr_scheduler
is now imported by default.
nn
 π Returning a dictionary from a nn.Module's forward function is now supported (used to throw an error)
 When
register_buffer("foo", ...)
is called, and self.foo already exists, then instead of silently failing, now raises aKeyError
 Fixed loading of older checkpoints of RNN/LSTM which were missing
_data_ptrs
attributes.  π
nn.Embedding
had a hard error when using themax_norm
option. This is fixed now.  π― when using the
max_norm
option, the passedin indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel. F.affine_grid
now can take noncontiguous inputs EmbeddingBag can accept both 1D and 2D inputs now.
 βͺ Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
 π fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
 if BatchNorm has only
1
value per channel in total, raise an error in training mode.  π Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
 π fix grid_sample backward when gradOutput is a zerostrided Tensor
 π Fix a segmentation fault when reflection padding is out of Tensor bounds.
 π If LogSoftmax has only 1 element,
inf
was returned. Now this correctly returns0.0
 Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
 Detect pointer aliasing in cuDNN RNN flatten_parameters and avoid that path.
 π Fixed ELU higher order gradients when applied inplace
 βͺ Workaround a CuDNN RNN bug for halfprecision
 Prevent numerical issues with
poisson_nll_loss
whenlog_input=False
by adding a small epsilon
distributed and multigpu
 π Allow kwargsonly inputs to DataParallel. This used to fail:
n = nn.DataParallel(Net()); out = n(input=i)
 DistributedDataParallel calculates num_samples correctly in python2
 π Fix the case of DistributedDataParallel when 1GPU per process is used.
 π Fixed DataParallel to specify GPUs that don't include GPU0
 DistributedDataParallel's exit doesn't error out anymore, the daemon flag is set.
 π Fix a bug in DistributedDataParallel in the case when model has no
buffers
(previously raised incoherent error)  Fix
__get_state__
to be functional inDistributedDataParallel
(was returning nothing)  π Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other
Others
model.zoo.load_url
now first attempts to use therequests
library if available, and then falls back tourllib
 Fix error when default_collate is passed a collection of
numpy.str_
 π₯ Breaking changes: removed

v0.2.0 Changes
August 28, 2017π Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
π Package documentation for this release is available at http://pytorch.org/docs/0.2.0/We're introducing longawaited features such as Broadcasting, Advanced Indexing, Higherorder gradients and finally: Distributed PyTorch.
Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.
Table of contents:
 π Tensor Broadcasting (numpystyle)
 Advanced Indexing for Tensors and Variables
 Higherorder gradients
 Distributed PyTorch (multinode training, etc.)
 Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
 π New in torch and autograd: matmul, inverse, etc.
 π Easier debugging, better error messages
 π Bug Fixes
 βͺ Important Breakages and Workarounds
π Tensor Broadcasting (numpystyle)
π In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).
π PyTorch Broadcasting semantics closely follow numpystyle broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.
General Semantics
Two tensors are βbroadcastableβ if the following rules hold:
 Each tensor has at least one dimension.
 When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
For Example:
\>\>\> x=torch.FloatTensor(5,7,3)\>\>\> y=torch.FloatTensor(5,7,3)# same shapes are always broadcastable (i.e. the above rules always hold)# can line up trailing dimensions\>\>\> x=torch.FloatTensor(5,3,4,1)\>\>\> y=torch.FloatTensor( 3,1,1)# x and y are broadcastable.# 1st trailing dimension: both have size 1# 2nd trailing dimension: y has size 1# 3rd trailing dimension: x size == y size# 4th trailing dimension: y dimension doesn't exist# but:\>\>\> x=torch.FloatTensor(5,2,4,1)\>\>\> y=torch.FloatTensor( 3,1,1)# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:
 If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
 Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.
For Example:
# can line up trailing dimensions to make reading easier\>\>\> x=torch.FloatTensor(5,1,4,1)\>\>\> y=torch.FloatTensor( 3,1,1)\>\>\> (x+y).size() torch.Size([5, 3, 4, 1])# error case\>\>\> x=torch.FloatTensor(5,2,4,1)\>\>\> y=torch.FloatTensor( 3,1,1)\>\>\> (x+y).size()RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at nonsingleton dimension 1
π More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.
Advanced Indexing for Tensors and Variables
π PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including nonadjacent indices and duplicate indices, using the same
[]
style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch'sIndex[Select, Add, ...]
functions.Let's look at some examples:
x = torch.Tensor(5, 5, 5)
Pure Integer Array Indexing  specify arbitrary indices at each dimension
x[[1, 2], [3, 2], [1, 0]]\> yields a 2element Tensor (x[1][3][1], x[2][2][0])
π also supports broadcasting, duplicates
x[[2, 3, 2], [0], [1]]\> yields a 3element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])
arbitrary indexer shapes allowed
x[[[1, 0], [0, 1]], [0], [1]].shape\> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]], [x[0][0][1], x[1][0][1]]]
can use colon, ellipse
x[[0, 3], :, :] x[[0, 3], ...]\> both yield a 2x5x5 Tensor [x[0], x[3]]
also use Tensors to index!
y = torch.LongTensor([0, 2, 4]) x[y, :, :]\> yields a 3x5x5 Tensor [x[0], x[2], x[4]]
selection with less than ndim, note the use of comma
x[[1, 3], ]\> yields a 2x5x5 Tensor [x[1], x[3]]
Higher order gradients
Now you can evaluate higher order differentials in PyTorch. For example, you can compute HessianVector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.
π In the
0.2
release, we've enabled the ability to compute higher order gradients for all oftorch.XXX
functions and the most popularnn
layers. The rest will be covered in the next release.Here's a short example that penalizes the norm of the weight gradients of a Resnet18 model, so that the volume of weights is slowchanging.
import torchfrom torchvision.models import resnet18from torch.autograd import Variable model = resnet18().cuda()# dummy inputs for the exampleinput = Variable(torch.randn(2,3,224,224).cuda(), requires\_grad=True) target = Variable(torch.zeros(2).long().cuda())# as usualoutput = model(input) loss = torch.nn.functional.nll\_loss(output, target) grad\_params = torch.autograd.grad(loss, model.parameters(), create\_graph=True)# torch.autograd.grad does not accumuate the gradients into the .grad attributes# It instead returns the gradients as Variable tuples.# now compute the 2norm of the grad\_paramsgrad\_norm = 0for grad in grad\_params: grad\_norm += grad.pow(2).sum() grad\_norm = grad\_norm.sqrt()# take the gradients wrt grad\_norm. backward() will accumulate# the gradients into the .grad attributesgrad\_norm.backward()# do an optimization stepoptimizer.step()
π We see two new concepts here:
π 1. torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the
.grad
attributes. This is useful if you want to further operate on the gradients. You can operate on the gradients, and call
backward()
on them.
π The list of
nn
layers that support higher order gradients are:AvgPool*d
,BatchNorm*d
,Conv*d
,MaxPool1d,2d
,Linear
,Bilinear
pad
,ConstantPad2d
,ZeroPad2d
,LPPool2d
,PixelShuffle
ReLU6
,LeakyReLU
,PReLU
,Tanh
,Tanhshrink
,Threshold
,Sigmoid
,HardTanh
,ELU
,Softsign
,SeLU
 π
L1Loss
,NLLLoss
,PoissonNLLLoss
,LogSoftmax
,Softmax2d
π The rest will be enabled in the next release.
π To enable higher order gradients, we've introduced a new style of writing
autograd.Function
(the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.Most of you dont write your own
autograd.Function
s, they are lowlevel primitives that introduce
π new operations to the autograd engine, where you specify the forward and backward calls.Distributed PyTorch
π¦ We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger minibatches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
π The
distributed
package follows an MPIstyle programming model. This means that there are functions provided to you such assend
,recv
,all_reduce
that will exchange Tensors among nodes (machines).For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:
 shared file system (requires that all processes can access a single file system)
 IP multicast (requires that all processes are in the same network)
 environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)
π Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:
import torch.distributed as dist dist.init\_process\_group(backend='tcp', init\_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456', world\_size=4)print('Hello from process {} (out of {})!'.format( dist.get\_rank(), dist.get\_world\_size()))
π¨ This would print
Hello from process 2 (out of 4)
on the 3rd machine.π· World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size  1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.
Here's a snippet that shows how simple pointtopoint communication can be performed:
# All processes (receiving ones too!) need to have tensors of appropriate# size preallocated.x = torch.Tensor(10)if dist.get\_rank() == 0: x.normal\_() # Send x to process with rank 1 dist.send(x, dst=1)else: # rank == 1# Receive data from process with rank 0 and save result in x dist.recv(x, src=0)
Asynchronous p2p functions (
isend
,irecv
) are available too.However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using
send
/recv
. One example isall_reduce
:x = torch.Tensor([dist.get\_rank()])# Add tensors from all processes such that they all receive the result.# x is an input and output to this operation.dist.all\_reduce(x)
π¦ The distributed package is fairly lowlevel, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but dataparallel training is such a common one that we have created highlevel helpers for it.
Hence, we've introduced
DistributedDataParallel
, which is meant to be a nearly dropin replacement for nn.DataParallel.
Here's a code snippet demonstrating changes necessary to add it to existing training code:# Wrap model in DistributedDataParallel (CUDA only for the moment)model = torch.nn.parallel.DistributedDataParallel(model.cuda())# Use a DistributedSampler to restrict each process to a distinct subset# of the dataset.train\_dataset = ...train\_sampler = torch.utils.data.distributed.DistributedSampler(train\_dataset) train\_loader = torch.utils.data.DataLoader( train\_dataset, batch\_size=args.batch\_size, num\_workers=args.workers, pin\_memory=True, sampler=train\_sampler)for epoch in range(args.num\_epochs): # Use .set\_epoch() method to reshuffle the dataset partition at every iteration train\_sampler.set\_epoch(epoch) # training loop...
π You can see a fuller Imagenet training example here
π New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
π New features
 forward_pre_hook is introduced to execute userspecified closures right before a forward function is called.
Convenient access to nonleaf gradients:
Currently, to access and inspect gradients of intermediate values, we have to usehooks
. This is not convenient for doing simple inspections. Hence, we introduceretain_grad
. It is best explained via an example:input = Variable(torch.rand(1, 3), requires_grad=True) h1 = input * 3out = (h1 * h1).sum() h1.retain_grad() out.backward()print(h1.grad)# without calling retain_grad(), h1.grad is None
π DataParallel now supports dicts as inputs
π New Layers
 Spatial Transformer Networks via
F.grid_sample
andF.affine_grid
nn.SeLU
andnn.AlphaDropout
are introduced, from the paper: SelfNormalizing Neural Networksnn.GLU
(Gated Linear Unit) is introduced from the paper Convolutional Sequence to Sequence Learning π Weight Normalization is now implemented via torch.utils.weight_norm.
 You can now ignore specific target indices while computing
cross_entropy_loss
andnll_loss
using theignore_index
argument. This is a cheap and useful way of implementing masking, where you can have amask
index that is ignored in computing the loss. F.normalize
implements dimensionwise renormalizationF.upsample
andnn.Upsample
consolidate multiple Upsampling layers into one function. It implements 2d and 3d bilinear/trilinear/nearest upsampling. π
nn.EmbeddingBag
: When build bagofwords models, doing anEmbedding
followed bySum
orMean
is common. For variable length sequences, computing bags of embeddings involves masking. We provide a singenn.EmbeddingBag
which is much more efficent and faster to compute bags of embeddings, especially for variable length sequences.  Numerically stable Binary CrossEntropy loss via
bce_with_logits
 π² A negative loglikelihood loss with Poisson distribution of the target via
PoissonNLLLoss
cosine_similarity
: Returns cosine similarity between x1 and x2, computed along dim.
training utilities
β± Learning Rate Schedulers: torch.optim.lr_scheduler provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.
π¦ There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the package docs:
 β¬οΈ ReduceLROnPlateau, LambdaLR, StepLR, MultiStepLR, ExponentialLR
ConcatDataset
that is a convenient dataset metaclass that can merge and concatenate two individual datasets.π New in torch and autograd
 0οΈβ£ All reduce functions such as
sum
andmean
now default to squeezing the reduced dimension. For exampletorch.sum(torch.randn(10, 20), 0)
returns a 1D Tensor. x.shape
, similar to numpy. A convenienceproperty
that is equivalent tox.size()
torch.matmul
, similar to np.matmul bitwise and, or, xor, lshift, rshift
 π autograd support for
inverse
,gesv
,cumprod
,atan2
 unbiased
var
andstd
now available via keyword argument option torch.scatter_add
 torch.scatter, except when duplicate indices are encountered, the values are summed. torch.median behaves similar to torch.sum when no arguments are given, i.e. it reduces all the dimensions and returns a single median value of the flattened Tensor.
 masked_copy_ has been renamed to masked_scatter_ (with deprecation on masked_copy_)
 π torch.manual_seed now seeds all CUDA devices as well
 You can now specify the random number generator object via keyword arguments
torch.rand(1000, generator=gen)
π Bugfixes and small improvements
Now we emit an error when a Variable is converted to a bool. For example:
b = Variable(torch.zeros(1)) if b[0]: # errors now
π Fix correctness bugs in qr decomposition on CUDA.
π Support for IBM PowerPC64 platform
Check that the CuDNN version at compiletime is the same version at runtime.
π Improve error message in CUDA forked subprocess
Faster transposedcopy on CPU
π Improve error messages in InstanceNorm
β Add more argument checking for various routines, especially BatchNorm and Convolution routines.
π Better error messages around shape reporting across the CPU backend.
π Support more than 8 GPUs per machine (workaround a CUDA p2p restriction)
π Improve error message when accessing attributes that don't exist
t() of Variable consistent with Tensor
prevent dividebyzero when dropout p=1
π fix sharing of CUDA tensors on noncurrent devices
when BN epsilon < allowed CuDNN value, fallback to THNN
π Fix threadtrashing when using different number of threads for MKL and OMP
π improve memory usage when using CuDNN RNN
π Fix ZeroPad2d backwards with negative padding
β add dummy tensor.data property, to provide interpretable error message to users
π Fix inplace division for Python3
π© Raise error when call from_numpy on 0dim array
Empty Tensors dont error out when shared across multiprocessing
π fix baddbmm for expanded tensors
Let parallel_apply accept arbitrary inputs
keyword arguments in Tensor and Variable are now consistent
π fix torch.inverse when Magma is not available
β Add logical not operator for ByteTensor
β add device asserts in scatter/gather kernels
βͺ Important Breakages and Workarounds
As you've read, we've introduced two important changes that are not
backward compatible: π Numpystyle Broadcasting
 0οΈβ£ Reduction functions such as
sum(1)
now default tokeepdim=False
π We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.
tl;dr
Here is a code snippet that you can add to the top of your scripts.
β Adding this code will generate warnings highlighting incompatible code.π Fix your code to no longer generate warnings.
# insert this to the top of your scripts (usually main.py)import sys, warnings, traceback, torchdef warn\_with\_traceback(message, category, filename, lineno, file=None, line=None): sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line)) traceback.print\_stack(sys.\_getframe(2)) warnings.showwarning = warn\_with\_traceback; warnings.simplefilter('always', UserWarning);torch.utils.backcompat.broadcast\_warning.enabled = Truetorch.utils.backcompat.keepdim\_warning.enabled = True
π Once all warnings disappear, you can remove the code snippet.
More elaborately
π Now, let us see the three incompatible changes with examples.
π Using the (now deprecated) 1dimensional view pointwise function
π Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal. The pointwise operation would then be carried out by viewing each tensor as 1dimensional. PyTorch now supports broadcasting. The β1dimensionalβ pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.
For example:
\>\>\> torch.add(torch.ones(4), torch.ones(2,2)) \_\_main\_\_:1: UserWarning: self and other not broadcastable, but have the same number of elements. Falling back to deprecated pointwise behavior.2222[torch.FloatTensor of size 4]
Broadcasting in code where it didn't happen before
The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape,
but are broadcastable and have the same number of elements.For example:
\>\>\> torch.add(torch.ones(4,1), torch.randn(4))
would previously produce a Tensor with size:
torch.Size([4,1])
,
but now produces a Tensor with size:torch.Size([4,4])
.β In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set
torch.utils.backcompat.broadcast_warning.enabled
toTrue
, which will generate a python warning in such cases.For Example:
\>\>\> torch.utils.backcompat.broadcast\_warning.enabled=True\>\>\> torch.add(torch.ones(4,1), torch.ones(4)) \_\_main\_\_:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
β Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.
KeepDim=False for Reduction Functions
β To get a warning when using a dimensional reduction function with the default keepdim argument, set
torch.utils.backcompat.keepdim_warning.enabled
toTrue
. For example:\>\>\> torch.sum(torch.ones(2,3), 1) \_\_main\_\_:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False. Consider passing as kwarg.33[torch.FloatTensor of size 2]
β As with
torch.utils.backcompat.broadcast_warning.enabled
, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.Note also that using
keepdim=False
can cause your existing code to "just work" with broadcasting. For example:# behavior with (old) keepdim=True, causes accidental broadcast\>\>\> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))5555555555555555[torch.FloatTensor of size 4x4]# new behavior with keepdim=False is equivalent to nonbroadcasted result\>\>\> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))5555[torch.FloatTensor of size 4]