xgboost v0.81 release notes (2018-11-04)

« Changelog History

xgboost v0.81 Release Notes

Release Date: 2018-11-04 // over 5 years ago

🆕 New feature: feature interaction constraints
- 👉 Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
- Tutorial is available, as well as R and Python examples.
🆕 New feature: learning to rank using scikit-learn interface
- 📦 Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
- Examples of using XGBRanker class is found at demo/rank/rank_sklearn.py.
🆕 New feature: R interface for SHAP interactions
- 📦 SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).
🆕 New feature: GPU predictor now use multiple GPUs to predict
- GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)
🆕 New feature: Scale distributed XGBoost to large-scale clusters
- 🛠 Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing select() based AllReduce/Broadcast with poll() based implementation.
- 👷 Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
- With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.
🆕 New feature: Additional objective functions for GPUs
- 🆕 New objective functions ported to GPU: hinge, multi:softmax, multi:softprob, count:poisson, reg:gamma, reg:tweedie.
- 👍 With supported objectives, XGBoost will select the correct devices based on your system and n_gpus parameter.
Major bug fix: learning to rank with XGBoost4J-Spark
- Previously, repartitionForData would shuffle data and lose ordering necessary for ranking task.
- To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).
🛠 Major bug fix: early stopping fixed in XGBoost4J-Spark
- ⚡️ Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
- A parameter maximize_evaluation_metrics is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.
API changes
- Column sampling by level (colsample_bylevel) is now functional for hist algorithm (#3635, #3862)
- 🗄 GPU tag gpu: for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643)
- 0️⃣ Add disable_default_eval_metric parameter to disable default metric (#3606)
- 🚚 Experimental AVX support for gradient computation is removed (#3752)
- XGBoost4J-Spark
  - Add rank:ndcg and rank:map to supported objectives (#3697)
- 📦 Python package
  - Add callbacks argument to fit() function of sciki-learn API (#3682)
  - Add XGBRanker to scikit-learn interface (#3560, #3848)
  - Add validate_features argument to predict() function of scikit-learn API (#3653)
  - Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
  - Add coef_ and intercept_ as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.
🐎 Performance improvements
- ➕ Address very high GPU memory usage for large data (#3635)
- 🛠 Fix performance regression within EvaluateSplits() of gpu_hist algorithm. (#3680)
🐛 Bug-fixes
- 🛠 Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
- 🛠 Fix copy constructor for HostDeviceVectorImpl to prevent dangling pointers (#3657)
- 🛠 Fix a bug in partitioned file loading (#3673)
- 🛠 Fixed an uninitialized pointer in gpu_hist (#3703)
- Reshared data among GPUs when number of GPUs is changed (#3721)
- Add back max_delta_step to split evaluation (#3668)
- Do not round up integer thresholds for integer features in JSON dump (#3717)
- 👉 Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way (#3783)
- Fix accuracy problem with gpu_hist when min_child_weight and lambda are set to 0 (#3793)
- 👉 Make sure that tree_method parameter is recognized and not silently ignored (#3849)
- XGBoost4J-Spark
  - Make sure thresholds are considered when executing predict() method (#3577)
  - Avoid losing precision when computing probabilities by converting to Double early (#3576)
  - getTreeLimit() should return Int (#3602)
  - Fix checkpoint serialization on HDFS (#3614)
  - Throw ControlThrowable instead of InterruptedException so that it is properly re-thrown (#3632)
  - Remove extraneous output to stdout (#3665)
  - Allow specification of task type for custom objectives and evaluations (#3646)
  - Fix distributed updater check (#3739)
  - Fix issue when spark job execution thread cannot return before we execute first() (#3758)
- 📦 Python package
  - Fix accessing DMatrix.handle before it is set (#3599)
  - XGBClassifier.predict() should return margin scores when output_margin is set to true (#3651)
  - Early stopping callback should maximize metric of form NDCG@n- (#3685)
  - Preserve feature names when slicing DMatrix (#3766)
- 📦 R package
  - Replace nround with nrounds to match actual parameter (#3592)
  - Amend xgb.createFolds to handle classes of a single element (#3630)
  - Fix buggy random generator and make colsample_bytree functional (#3781)
🚧 Maintenance: testing, continuous integration, build system
- ➕ Add sanitizers tests to Travis CI (#3557)
- ➕ Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
- Comply with CRAN submission policy (#3660, #3728)
- ✂ Remove copy-paste error in JVM test suite (#3692)
- ⚡️ Disable flaky tests in R-package/tests/testthat/test_update.R (#3723)
- 🚀 Make Python tests compatible with scikit-learn 0.20 release (#3731)
- 🏗 Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
- ➕ Add multi-GPU unit test environment (#3741)
- 👍 Allow plug-ins to be built by CMake (#3752)
- ✅ Test wheel compatibility on CPU containers for pull requests (#3762)
- 🛠 Fix broken doc build due to Matplotlib 3.0 release (#3764)
- Produce xgboost.so for XGBoost-R on Mac OSX, so that make install works (#3767)
- 👷 Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
- ➕ Add basic unit tests for gpu_hist algorithm (#3785)
- 🛠 Fix Python environment for distributed unit tests (#3806)
- ✅ Test wheels on CUDA 10.0 container for compatibility (#3838)
- 🛠 Fix JVM doc build (#3853)
🔨 Maintenance: Refactor C++ code for legibility and maintainability
- 🔀 Merge generic device helper functions into GPUSet class (#3626)
- Re-factor column sampling logic into ColumnSampler class (#3635, #3637)
- 📜 Replace std::vector with HostDeviceVector in MetaInfo and SparsePage (#3446)
- Simplify DMatrix class (#3395)
- De-duplicate CPU/GPU code using Transform class (#3643, #3751)
- ✂ Remove obsoleted QuantileHistMaker class (#3761)
- ✂ Remove obsoleted NoConstraint class (#3792)
Other Features
- C++20-compliant Span class for safe pointer indexing (#3548, #3588)
- ➕ Add helper functions to manipulate multiple GPU devices (#3693)
- XGBoost4J-Spark
  - Allow specifying host ip from the xgboost-tracker.properties file (#3833). This comes in handy when hosts files doesn't correctly define localhost.
Usability Improvements
- ➕ Add reference to GitHub repository in pom.xml of JVM packages (#3589)
- ➕ Add R demo of multi-class classification (#3695)
- Document JSON dump functionality (#3600, #3603)
- Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
- Document LambdaMART objectives, both pairwise and listwise (#3672)
- Document aucpr evaluation metric (#3687)
- Document gblinear parameters: feature_selector and top_k (#3780)
- ➕ Add instructions for using MinGW-built XGBoost with Python. (#3774)
- ✂ Removed nonexistent parameter use_buffer from documentation (#3610)
- ⚡️ Update Python API doc to include all classes and members (#3619, #3682)
- 🛠 Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
- Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
- 🖨 Process data once for Python and CLI examples of learning to rank (#3666)
- Include full text of Apache 2.0 license in the repository (#3698)
- 💾 Save predictor parameters in model file (#3856)
- 📦 JVM packages
  - Let users specify feature names when calling getModelDump and getFeatureScore (#3733)
  - Warn the user about the lack of over-the-wire encryption (#3667)
  - Fix errors in examples (#3719)
  - Document choice of trackers (#3831)
  - Document that vanilla Apache Spark is required (#3854)
- 📦 Python package
  - Document that custom objective can't contain colon (:) (#3601)
  - Show a better error message for failed library loading (#3690)
  - Document that feature importance is unavailable for non-tree learners (#3765)
  - Document behavior of get_fscore() for zero-importance features (#3763)
  - Recommend pickling as the way to save XGBClassifier / XGBRegressor / XGBRanker (#3829)
- 📦 R package
  - Enlarge variable importance plot to make it more visible (#3820)
💥 BREAKING CHANGES
- ⬆️ External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all *.page files before resuming training. Model serialization is unaffected.
Known issues
- Quantile sketcher fails to produce any quantile for some edge cases (#2943)
- The hist algorithm leaks memory when used with learning rate decay callback (#3579)
- Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
- Early stopping doesn't work with gblinear learner (#3789)
- Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the DMatrix object and re-load.
- The DMatrix Python objects are initialized with incorrect values when given array slices (#3841)
- 👍 The gpu_id parameter is broken and not yet properly supported (#3850)
Acknowledgement

Contributors (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)

First-time Contributors (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH

First-time Reviewers (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)

xgboost v0.81

Version Release Notes from November 04, 2018 (over 5 years ago)

« Changelog History

xgboost v0.81 Release Notes

🆕 New feature: feature interaction constraints

🆕 New feature: learning to rank using scikit-learn interface

🆕 New feature: R interface for SHAP interactions

🆕 New feature: GPU predictor now use multiple GPUs to predict

🆕 New feature: Scale distributed XGBoost to large-scale clusters

🆕 New feature: Additional objective functions for GPUs

Major bug fix: learning to rank with XGBoost4J-Spark

🛠 Major bug fix: early stopping fixed in XGBoost4J-Spark

API changes

🐎 Performance improvements

🐛 Bug-fixes

🚧 Maintenance: testing, continuous integration, build system

🔨 Maintenance: Refactor C++ code for legibility and maintainability

Other Features

Usability Improvements

💥 BREAKING CHANGES

Known issues

Acknowledgement