xgboost v0.90 Release Notes

Release Date: 2019-05-20 // 4 months ago
  • 📦 XGBoost Python package drops Python 2.x (#4379, #4381)

    📦 Python 2.x is reaching its end-of-life at the end of this year. Many scientific Python packages are now moving to drop Python 2.x.

    XGBoost4J-Spark now requires Spark 2.4.x (#4377)

    • 👀 Spark 2.3 is reaching its end-of-life soon. See discussion at #4389.
    • Consistent handling of missing values (#4309, #4349, #4411): Many users had reported issue with inconsistent predictions between XGBoost4J-Spark and the Python XGBoost package. The issue was caused by Spark mis-handling non-zero missing values (NaN, -1, 999 etc). We now alert the user whenever Spark doesn't handle missing values correctly (#4309, #4349). See the tutorial for dealing with missing values in XGBoost4J-Spark. This fix also depends on the availability of Spark 2.4.x.

    🐎 Roadmap: better performance scaling for multi-core CPUs (#4310)

    • 🐎 Poor performance scaling of the hist algorithm for multi-core CPUs has been under investigation (#3810). #4310 optimizes quantile sketches and other pre-processing tasks. Special thanks to @SmirnovEgorRu.

    Roadmap: Harden distributed training (#4250)

    • Make distributed training in XGBoost more robust by hardening Rabit, which implements the AllReduce primitive. In particular, improve test coverage on mechanisms for fault tolerance and recovery. Special thanks to @chenqin.

    🆕 New feature: Multi-class metric functions for GPUs (#4368)

    • Metrics for multi-class classification have been ported to GPU: merror, mlogloss. Special thanks to @trivialfis.
    • 👍 With supported metrics, XGBoost will select the correct devices based on your system and n_gpus parameter.

    🆕 New feature: Scikit-learn-like random forest API (#4148, #4255, #4258)

    • 🚀 XGBoost Python package now offers XGBRFClassifier and XGBRFRegressor API to train random forests. See the tutorial. Special thanks to @canonizer

    🆕 New feature: use external memory in GPU predictor (#4284, #4396, #4438, #4457)

    It is now possible to make predictions on GPU when the input is read from external memory. This is useful when you want to make predictions with big dataset that does not fit into the GPU memory. Special thanks to @rongou, @canonizer, @sriramch.

    dtest = xgboost.DMatrix('test\_data.libsvm#dtest.cache') bst.set\_param('predictor', 'gpu\_predictor') bst.predict(dtest)
    

    Coming soon: GPU training (gpu_hist) with external memory

    🆕 New feature: XGBoost can now handle comments in LIBSVM files (#4430)

    🆕 New feature: Embed XGBoost in your C/C++ applications using CMake (#4323, #4333, #4453)

    It is now easier than ever to embed XGBoost in your C/C++ applications. In your CMakeLists.txt, add xgboost::xgboost as a linked library:

    find\_package(xgboost REQUIRED)add\_executable(api-demo c-api-demo.c)target\_link\_libraries(api-demo xgboost::xgboost)
    

    📚 XGBoost C API documentation is available. Special thanks to @trivialfis

    🐎 Performance improvements

    • 👉 Use feature interaction constraints to narrow split search space (#4341, #4428)
    • ➕ Additional optimizations for gpu_hist (#4248, #4283)
    • ⬇️ Reduce OpenMP thread launches in gpu_hist (#4343)
    • ➕ Additional optimizations for multi-node multi-GPU random forests. (#4238)
    • Allocate unique prediction buffer for each input matrix, to avoid re-sizing GPU array (#4275)
    • ✂ Remove various synchronisations from CUDA API calls (#4205)
    • XGBoost4J-Spark
      • Allow the user to control whether to cache partitioned training data, to potentially reduce execution time (#4268)

    🐛 Bug-fixes

    • 🛠 Fix node reuse in hist (#4404)
    • 🛠 Fix GPU histogram allocation (#4347)
    • 🛠 Fix matrix attributes not sliced (#4311)
    • Revise AUC and AUCPR metrics now work with weighted ranking task (#4216, #4436)
    • 🛠 Fix timer invocation for InitDataOnce() in gpu_hist (#4206)
    • 🛠 Fix R-devel errors (#4251)
    • ⚡️ Make gradient update in GPU linear updater thread-safe (#4259)
    • Prevent out-of-range access in column matrix (#4231)
    • 👻 Don't store DMatrix handle in Python object until it's initialized, to improve exception safety (#4317)
    • XGBoost4J-Spark
      • Fix non-deterministic order within a zipped partition on prediction (#4388)
      • Remove race condition on tracker shutdown (#4224)
      • Allow set the parameter maxLeaves. (#4226)
      • Allow partial evaluation of dataframe before prediction (#4407)
      • Automatically set maximize_evaluation_metrics if not explicitly given (#4446)

    API changes

    • 🗄 Deprecate reg:linear in favor of reg:squarederror. (#4267, #4427)
    • ➕ Add attribute getter and setter to the Booster object in XGBoost4J (#4336)

    ♻️ Maintenance: Refactor C++ code for legibility and maintainability

    • 🛠 Fix clang-tidy warnings. (#4149)
    • ✂ Remove deprecated C APIs. (#4266)
    • 👉 Use Monitor class to time functions in hist. (#4273)
    • 💅 Retire DVec class in favour of c++20 style span for device memory. (#4293)
    • 👌 Improve HostDeviceVector exception safety (#4301)

    🚧 Maintenance: testing, continuous integration, build system

    • ♻️ Major refactor of CMakeLists.txt (#4323, #4333, #4453): adopt modern CMake and export XGBoost as a target
    • 👷 Major improvement in Jenkins CI pipeline (#4234)
      • Migrate all Linux tests to Jenkins (#4401)
      • Builds and tests are now de-coupled, to test an artifact against multiple versions of CUDA, JDK, and other dependencies (#4401)
      • Add Windows GPU to Jenkins CI pipeline (#4463, #4469)
    • 👌 Support CUDA 10.1 (#4223, #4232, #4265, #4468)
    • Python wheels are now built with CUDA 9.0, so that JIT is not required on Volta architecture (#4459)
    • ↔ Integrate with NVTX CUDA profiler (#4205)
    • ➕ Add a test for cpu predictor using external memory (#4308)
    • ♻️ Refactor tests to get rid of duplication (#4358)
    • ✂ Remove test dependency on craigcitro/r-travis, since it's deprecated (#4353)
    • ➕ Add files from local R build to .gitignore (#4346)
    • 👉 Make XGBoost4J compatible with Java 9+ by revising NativeLibLoader (#4351)
    • 🏗 Jenkins build for CUDA 10.0 (#4281)
    • ✂ Remove remaining silent and debug_verbose in Python tests (#4299)
    • 🐧 Use all cores to build XGBoost4J lib on linux (#4304)
    • ⬆️ Upgrade Jenkins Linux build environment to GCC 5.3.1, CMake 3.6.0 (#4306)
    • 👉 Make CMakeLists.txt compatible with CMake 3.3 (#4420)
    • ➕ Add OpenMP option in CMakeLists.txt (#4339)
    • ⚠ Get rid of a few trivial compiler warnings (#4312)
    • ➕ Add external Docker build cache, to speed up builds on Jenkins CI (#4331, #4334, #4458)
    • 🛠 Fix Windows tests (#4403)
    • 🛠 Fix a broken python test (#4395)
    • 👀 Use a fixed seed to split data in XGBoost4J-Spark tests, for reproducibility (#4417)
    • ➕ Add additional Python tests to test training under constraints (#4426)
    • 🏗 Enable building with shared NCCL. (#4447)

    📚 Usability Improvements, Documentation

    • Document limitation of one-split-at-a-time Greedy tree learning heuristic (#4233)
    • ⚡️ Update build doc: PyPI wheel now support multi-GPU (#4219)
    • Fix docs for num_parallel_tree (#4221)
    • Fix document about colsample_by* parameter (#4340)
    • ✅ Make the train and test input with same colnames. (#4329)
    • ⚡️ Update R contribute link. (#4236)
    • 🛠 Fix travis R tests (#4277)
    • 🌲 Log version number in crash log in XGBoost4J-Spark (#4271, #4303)
    • 👍 Allow supression of Rabit output in Booster::train in XGBoost4J (#4262)
    • ➕ Add tutorial on handling missing values in XGBoost4J-Spark (#4425)
    • 🛠 Fix typos (#4345, #4393, #4432, #4435)
    • ➕ Added language classifier in setup.py (#4327)
    • ➕ Added Travis CI badge (#4344)
    • ➕ Add BentoML to use case section (#4400)
    • ✂ Remove subtly sexist remark (#4418)
    • ➕ Add R vignette about parsing JSON dumps (#4439)

    Acknowledgement

    Contributors : Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Daniel Hen (@Daniel8hen), Jiaxiang Li (@JiaxiangBU), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), Andy Adinets (@canonizer), Jonas (@elcombato), Harry Braviner (@harrybraviner), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), James Lamb (@jameslamb), Jean-Francois Zinque (@jeffzi), Yang Yang (@jokerkeny), Mayank Suman (@mayanksuman), jess (@monkeywithacupcake), Hajime Morrita (@omo), Ravi Kalia (@project-delphi), @ras44, Rong Ou (@rongou), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), @sriramch, Jiaming Yuan (@trivialfis), Christopher Suchanek (@wsuchy), Bozhao (@yubozhao)

    Reviewers : Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Daniel Hen (@Daniel8hen), Jiaxiang Li (@JiaxiangBU), Laurae (@Laurae2), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), @alois-bissuel, Andy Adinets (@canonizer), Chen Qin (@chenqin), Harry Braviner (@harrybraviner), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), @jakirkham, James Lamb (@jameslamb), Julien Schueller (@jschueller), Mayank Suman (@mayanksuman), Hajime Morrita (@omo), Rong Ou (@rongou), Sara Robinson (@sararob), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), @sriramch, Sean Owen (@srowen), Sergei Lebedev (@superbobry), Yuan (Terry) Tang (@terrytangyuan), Theodore Vasiloudis (@thvasilo), Matthew Tovbin (@tovbinm), Jiaming Yuan (@trivialfis), Xin Yin (@xydrolase)


Previous changes from v0.82

  • 🚀 This release is packed with many new features and bug fixes.

    🐎 Roadmap: better performance scaling for multi-core CPUs (#3957)

    • 🐎 Poor performance scaling of the hist algorithm for multi-core CPUs has been under investigation (#3810). #3957 marks an important step toward better performance scaling, by using software pre-fetching and replacing STL vectors with C-style arrays. Special thanks to @Laurae2 and @SmirnovEgorRu.
    • 👀 See #3810 for latest progress on this roadmap.

    🆕 New feature: Distributed Fast Histogram Algorithm (hist) (#4011, #4102, #4140, #4128)

    • It is now possible to run the hist algorithm in distributed setting. Special thanks to @CodingCat. The benefits include:
      1. Faster local computation via feature binning
      2. Support for monotonic constraints and feature interaction constraints
      3. Simpler codebase than approx, allowing for future improvement
    • 🔀 Depth-wise tree growing is now performed in a separate code path, so that cross-node syncronization is performed only once per level.

    🆕 New feature: Multi-Node, Multi-GPU training (#4095)

    • Distributed training is now able to utilize clusters equipped with NVIDIA GPUs. In particular, the rabit AllReduce layer will communicate GPU device information. Special thanks to @mt-jones, @RAMitchell, @rongou, @trivialfis, @canonizer, and @jeffdk.
    • Resource management systems will be able to assign a rank for each GPU in the cluster.
    • 👷 In Dask, users will be able to construct a collection of XGBoost processes over an inhomogeneous device cluster (i.e. workers with different number and/or kinds of GPUs).

    🆕 New feature: Multiple validation datasets in XGBoost4J-Spark (#3904, #3910)

    • 🐎 You can now track the performance of the model during training with multiple evaluation datasets. By specifying eval_sets or call setEvalSets over a XGBoostClassifier or XGBoostRegressor, you can pass in multiple evaluation datasets typed as a Map from String to DataFrame. Special thanks to @CodingCat.
    • 👀 See the usage of multiple validation datasets here

    🆕 New feature: Additional metric functions for GPUs (#3952)

    • Element-wise metrics have been ported to GPU: rmse, mae, logloss, poisson-nloglik, gamma-deviance, gamma-nloglik, error, tweedie-nloglik. Special thanks to @trivialfis and @RAMitchell.
    • 👍 With supported metrics, XGBoost will select the correct devices based on your system and n_gpus parameter.

    🆕 New feature: Column sampling at individual nodes (splits) (#3971)

    • 0️⃣ Columns (features) can now be sampled at individual tree nodes, in addition to per-tree and per-level sampling. To enable per-node sampling, set colsample_bynode parameter, which represents the fraction of columns sampled at each node. This parameter is set to 1.0 by default (i.e. no sampling per node). Special thanks to @canonizer.
    • The colsample_bynode parameter works cumulatively with other colsample_by* parameters: for example, {'colsample_bynode':0.5, 'colsample_bytree':0.5} with 100 columns will give 25 features to choose from at each split.

    🌲 Major API change: consistent logging level via verbosity (#3982, #4002, #4138)

    • ⚠ XGBoost now allows fine-grained control over logging. You can set verbosity to 0 (silent), 1 (warning), 2 (info), and 3 (debug). This is useful for controlling the amount of logging outputs. Special thanks to @trivialfis.
    • 🗄 Parameters silent and debug_verbose are now deprecated.
    • 🔧 Note: Sometimes XGBoost tries to change configurations based on heuristics, which is displayed as warning message. If there's unexpected behaviour, please try to increase value of verbosity.

    Major bug fix: external memory (#4040, #4193)

    • Clarify object ownership in multi-threaded prefetcher, to avoid memory error.
    • Correctly merge two column batches (which uses CSC layout).
    • ➕ Add unit tests for external memory.
    • Special thanks to @trivialfis and @hcho3.

    🛠 Major bug fix: early stopping fixed in XGBoost4J and XGBoost4J-Spark (#3928, #4176)

    • 📦 Early stopping in XGBoost4J and XGBoost4J-Spark is now consistent with its counterpart in the Python package. Training stops if the current iteration is earlyStoppingSteps away from the best iteration. If there are multiple evaluation sets, only the last one is used to determinate early stop.
    • See the updated documentation here
    • Special thanks to @CodingCat, @yanboliang, and @mingyang.

    Major bug fix: infrequent features should not crash distributed training (#4045)

    • 🛠 For infrequently occuring features, some partitions may not get any instance. This scenario used to crash distributed training due to mal-formed ranges. The problem has now been fixed.
    • In practice, one-hot-encoded categorical variables tend to produce rare features, particularly when the cardinality is high.
    • Special thanks to @CodingCat.

    🐎 Performance improvements

    • Faster, more space-efficient radix sorting in gpu_hist (#3895)
    • Subtraction trick in histogram calculation in gpu_hist (#3945)
    • More performant re-partition in XGBoost4J-Spark (#4049)

    🐛 Bug-fixes

    • 🛠 Fix semantics of gpu_id when running multiple XGBoost processes on a multi-GPU machine (#3851)
    • 🛠 Fix page storage path for external memory on Windows (#3869)
    • 🛠 Fix configuration setup so that DART utilizes GPU (#4024)
    • Eliminate NAN values from SHAP prediction (#3943)
    • Prevent empty quantile sketches in hist (#4155)
    • Enable running objectives with 0 GPU (#3878)
    • Parameters are no longer dependent on system locale (#3891, #3907)
    • 👉 Use consistent data type in the GPU coordinate descent code (#3917)
    • ✂ Remove undefined behavior in the CLI config parser on the ARM platform (#3976)
    • 🎉 Initialize counters in GPU AllReduce (#3987)
    • Prevent deadlocks in GPU AllReduce (#4113)
    • Load correct values from sliced NumPy arrays (#4147, #4165)
    • 🛠 Fix incorrect GPU device selection (#4161)
    • 👉 Make feature binning logic in hist aware of query groups when running a ranking task (#4115). For ranking task, query groups are weighted, not individual instances.
    • 🌲 Generate correct C++ exception type for LOG(FATAL) macro (#4159)
    • 📦 Python package
      • Python package should run on system without PATH environment variable (#3845)
      • Fix coef_ and intercept_ signature to be compatible with sklearn.RFECV (#3873)
      • Use UTF-8 encoding in Python package README, to support non-English locale (#3867)
      • Add AUC-PR to list of metrics to maximize for early stopping (#3936)
      • Allow loading pickles without self.booster attribute, for backward compatibility (#3938, #3944)
      • White-list DART for feature importances (#4073)
      • Update usage of h2oai/datatable (#4123)
    • XGBoost4J-Spark
      • Address scalability issue in prediction (#4033)
      • Enforce the use of per-group weights for ranking task (#4118)
      • Fix vector size of rawPredictionCol in XGBoostClassificationModel (#3932)
      • More robust error handling in Spark tracker (#4046, #4108)
      • Fix return type of setEvalSets (#4105)
      • Return correct value of getMaxLeaves (#4114)

    API changes

    • Add experimental parameter single_precision_histogram to use single-precision histograms for the gpu_hist algorithm (#3965)
    • 📦 Python package
      • Add option to select type of feature importances in the scikit-learn inferface (#3876)
      • Add trees_to_df() method to dump decision trees as Pandas data frame (#4153)
      • Add options to control node shapes in the GraphViz plotting function (#3859)
      • Add xgb_model option to XGBClassifier, to load previously saved model (#4092)
      • Passing lists into DMatrix is now deprecated (#3970)
    • XGBoost4J
      • Support multiple feature importance features (#3801)

    ♻️ Maintenance: Refactor C++ code for legibility and maintainability

    • ♻️ Refactor hist algorithm code and add unit tests (#3836)
    • ♻️ Minor refactoring of split evaluator in gpu_hist (#3889)
    • ✂ Removed unused leaf vector field in the tree model (#3989)
    • Simplify the tree representation by combining TreeModel and RegTree classes (#3995)
    • Simplify and harden tree expansion code (#4008, #4015)
    • De-duplicate parameter classes in the linear model algorithms (#4013)
    • Robust handling of ranges with C++20 span in gpu_exact and gpu_coord_descent (#4020, #4029)
    • Simplify tree training code (#3825). Also use Span class for robust handling of ranges.

    🚧 Maintenance: testing, continuous integration, build system

    • 👍 Disallow std::regex since it's not supported by GCC 4.8.x (#3870)
    • ➕ Add multi-GPU tests for coordinate descent algorithm for linear models (#3893, #3974)
    • 💅 Enforce naming style in Python lint (#3896)
    • ♻️ Refactor Python tests (#3897, #3901): Use pytest exclusively, display full trace upon failure
    • ➕ Address DeprecationWarning when using Python collections (#3909)
    • 🔌 Use correct group for maven site plugin (#3937)
    • 👷 Jenkins CI is now using on-demand EC2 instances exclusively, due to unreliability of Spot instances (#3948)
    • 👍 Better GPU performance logging (#3945)
    • 🛠 Fix GPU tests on machines with only 1 GPU (#4053)
    • ⚠ Eliminate CRAN check warnings and notes (#3988)
    • ➕ Add unit tests for tree serialization (#3989)
    • ➕ Add unit tests for tree fitting functions in hist (#4155)
    • ➕ Add a unit test for gpu_exact algorithm (#4020)
    • Correct JVM CMake GPU flag (#4071)
    • 🛠 Fix failing Travis CI on Mac (#4086)
    • Speed up Jenkins by not compiling CMake (#4099)
    • 👷 Analyze C++ and CUDA code using clang-tidy, as part of Jenkins CI pipeline (#4034)
    • 🛠 Fix broken R test: Install Homebrew GCC (#4142)
    • ✅ Check for empty datasets in GPU unit tests (#4151)
    • 🛠 Fix Windows compilation (#4139)
    • 👕 Comply with latest convention of cpplint (#4157)
    • 🛠 Fix a unit test in gpu_hist (#4158)
    • ✅ Speed up data generation in Python tests (#4164)

    Usability Improvements

    • ➕ Add link to InfoWorld 2019 Technology of the Year Award (#4116)
    • ✂ Remove outdated AWS YARN tutorial (#3885)
    • Document current limitation in number of features (#3886)
    • ✂ Remove unnecessary warning when gblinear is selected (#3888)
    • 📜 Document limitation of CSV parser: header not supported (#3934)
    • 🌲 Log training parameters in XGBoost4J-Spark (#4091)
    • Clarify early stopping behavior in the scikit-learn interface (#3967)
    • Clarify behavior of max_depth parameter (#4078)
    • 📄 Revise Python docstrings for ranking task (#4121). In particular, weights must be per-group in learning-to-rank setting.
    • Document parameter num_parallel_tree (#4022)
    • ➕ Add Jenkins status badge (#4090)
    • Warn users against using internal functions of Booster object (#4066)
    • 💅 Reformat benchmark_tree.py to comply with Python style convention (#4126)
    • Clarify a comment in objectiveTrait (#4174)
    • 🛠 Fix typos and broken links in documentation (#3890, #3872, #3902, #3919, #3975, #4027, #4156, #4167)

    Acknowledgement

    Contributors (in no particular order): Jiaming Yuan (@trivialfis), Hyunsu Cho (@hcho3), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Yanbo Liang (@yanboliang), Andy Adinets (@canonizer), Tong He (@hetong007), Yuan Tang (@terrytangyuan)

    First-time Contributors (in no particular order): Jelle Zijlstra (@JelleZijlstra), Jiacheng Xu (@jiachengxu), @ajing, Kashif Rasul (@kashif), @theycallhimavi, Joey Gao (@pjgao), Prabakaran Kumaresshan (@nixphix), Huafeng Wang (@huafengw), @lyxthe, Sam Wilkinson (@scwilkinson), Tatsuhito Kato (@stabacov), Shayak Banerjee (@shayakbanerjee), Kodi Arfer (@Kodiologist), @KyleLi1985, Egor Smirnov (@SmirnovEgorRu), @tmitanitky, Pasha Stetsenko (@st-pasha), Kenichi Nagahara (@keni-chi), Abhai Kollara Dilip (@abhaikollara), Patrick Ford (@pford221), @hshujuan, Matthew Jones (@mt-jones), Thejaswi Rao (@teju85), Adam November (@anovember)

    First-time Reviewers (in no particular order): Mingyang Hu (@mingyang), Theodore Vasiloudis (@thvasilo), Jakub Troszok (@troszok), Rong Ou (@rongou), @Denisevi4, Matthew Jones (@mt-jones), Jeff Kaplan (@jeffdk)