xgboost v0.81 Release Notes

Release Date: 2018-11-04 // over 5 years ago
  • ๐Ÿ†• New feature: feature interaction constraints

    • ๐Ÿ‘‰ Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
    • Tutorial is available, as well as R and Python examples.

    ๐Ÿ†• New feature: learning to rank using scikit-learn interface

    • ๐Ÿ“ฆ Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
    • Examples of using XGBRanker class is found at demo/rank/rank_sklearn.py.

    ๐Ÿ†• New feature: R interface for SHAP interactions

    • ๐Ÿ“ฆ SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).

    ๐Ÿ†• New feature: GPU predictor now use multiple GPUs to predict

    • GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)

    ๐Ÿ†• New feature: Scale distributed XGBoost to large-scale clusters

    • ๐Ÿ›  Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing select() based AllReduce/Broadcast with poll() based implementation.
    • ๐Ÿ‘ท Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
    • With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.

    ๐Ÿ†• New feature: Additional objective functions for GPUs

    • ๐Ÿ†• New objective functions ported to GPU: hinge, multi:softmax, multi:softprob, count:poisson, reg:gamma, reg:tweedie.
    • ๐Ÿ‘ With supported objectives, XGBoost will select the correct devices based on your system and n_gpus parameter.

    Major bug fix: learning to rank with XGBoost4J-Spark

    • Previously, repartitionForData would shuffle data and lose ordering necessary for ranking task.
    • To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).

    ๐Ÿ›  Major bug fix: early stopping fixed in XGBoost4J-Spark

    • โšก๏ธ Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
    • A parameter maximize_evaluation_metrics is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.

    API changes

    • Column sampling by level (colsample_bylevel) is now functional for hist algorithm (#3635, #3862)
    • ๐Ÿ—„ GPU tag gpu: for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643)
    • 0๏ธโƒฃ Add disable_default_eval_metric parameter to disable default metric (#3606)
    • ๐Ÿšš Experimental AVX support for gradient computation is removed (#3752)
    • XGBoost4J-Spark
      • Add rank:ndcg and rank:map to supported objectives (#3697)
    • ๐Ÿ“ฆ Python package
      • Add callbacks argument to fit() function of sciki-learn API (#3682)
      • Add XGBRanker to scikit-learn interface (#3560, #3848)
      • Add validate_features argument to predict() function of scikit-learn API (#3653)
      • Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
      • Add coef_ and intercept_ as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.

    ๐ŸŽ Performance improvements

    • โž• Address very high GPU memory usage for large data (#3635)
    • ๐Ÿ›  Fix performance regression within EvaluateSplits() of gpu_hist algorithm. (#3680)

    ๐Ÿ› Bug-fixes

    • ๐Ÿ›  Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
    • ๐Ÿ›  Fix copy constructor for HostDeviceVectorImpl to prevent dangling pointers (#3657)
    • ๐Ÿ›  Fix a bug in partitioned file loading (#3673)
    • ๐Ÿ›  Fixed an uninitialized pointer in gpu_hist (#3703)
    • Reshared data among GPUs when number of GPUs is changed (#3721)
    • Add back max_delta_step to split evaluation (#3668)
    • Do not round up integer thresholds for integer features in JSON dump (#3717)
    • ๐Ÿ‘‰ Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way (#3783)
    • Fix accuracy problem with gpu_hist when min_child_weight and lambda are set to 0 (#3793)
    • ๐Ÿ‘‰ Make sure that tree_method parameter is recognized and not silently ignored (#3849)
    • XGBoost4J-Spark
      • Make sure thresholds are considered when executing predict() method (#3577)
      • Avoid losing precision when computing probabilities by converting to Double early (#3576)
      • getTreeLimit() should return Int (#3602)
      • Fix checkpoint serialization on HDFS (#3614)
      • Throw ControlThrowable instead of InterruptedException so that it is properly re-thrown (#3632)
      • Remove extraneous output to stdout (#3665)
      • Allow specification of task type for custom objectives and evaluations (#3646)
      • Fix distributed updater check (#3739)
      • Fix issue when spark job execution thread cannot return before we execute first() (#3758)
    • ๐Ÿ“ฆ Python package
      • Fix accessing DMatrix.handle before it is set (#3599)
      • XGBClassifier.predict() should return margin scores when output_margin is set to true (#3651)
      • Early stopping callback should maximize metric of form NDCG@n- (#3685)
      • Preserve feature names when slicing DMatrix (#3766)
    • ๐Ÿ“ฆ R package
      • Replace nround with nrounds to match actual parameter (#3592)
      • Amend xgb.createFolds to handle classes of a single element (#3630)
      • Fix buggy random generator and make colsample_bytree functional (#3781)

    ๐Ÿšง Maintenance: testing, continuous integration, build system

    • โž• Add sanitizers tests to Travis CI (#3557)
    • โž• Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
    • Comply with CRAN submission policy (#3660, #3728)
    • โœ‚ Remove copy-paste error in JVM test suite (#3692)
    • โšก๏ธ Disable flaky tests in R-package/tests/testthat/test_update.R (#3723)
    • ๐Ÿš€ Make Python tests compatible with scikit-learn 0.20 release (#3731)
    • ๐Ÿ— Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
    • โž• Add multi-GPU unit test environment (#3741)
    • ๐Ÿ‘ Allow plug-ins to be built by CMake (#3752)
    • โœ… Test wheel compatibility on CPU containers for pull requests (#3762)
    • ๐Ÿ›  Fix broken doc build due to Matplotlib 3.0 release (#3764)
    • Produce xgboost.so for XGBoost-R on Mac OSX, so that make install works (#3767)
    • ๐Ÿ‘ท Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
    • โž• Add basic unit tests for gpu_hist algorithm (#3785)
    • ๐Ÿ›  Fix Python environment for distributed unit tests (#3806)
    • โœ… Test wheels on CUDA 10.0 container for compatibility (#3838)
    • ๐Ÿ›  Fix JVM doc build (#3853)

    ๐Ÿ”จ Maintenance: Refactor C++ code for legibility and maintainability

    • ๐Ÿ”€ Merge generic device helper functions into GPUSet class (#3626)
    • Re-factor column sampling logic into ColumnSampler class (#3635, #3637)
    • ๐Ÿ“œ Replace std::vector with HostDeviceVector in MetaInfo and SparsePage (#3446)
    • Simplify DMatrix class (#3395)
    • De-duplicate CPU/GPU code using Transform class (#3643, #3751)
    • โœ‚ Remove obsoleted QuantileHistMaker class (#3761)
    • โœ‚ Remove obsoleted NoConstraint class (#3792)

    Other Features

    • C++20-compliant Span class for safe pointer indexing (#3548, #3588)
    • โž• Add helper functions to manipulate multiple GPU devices (#3693)
    • XGBoost4J-Spark
      • Allow specifying host ip from the xgboost-tracker.properties file (#3833). This comes in handy when hosts files doesn't correctly define localhost.

    Usability Improvements

    • โž• Add reference to GitHub repository in pom.xml of JVM packages (#3589)
    • โž• Add R demo of multi-class classification (#3695)
    • Document JSON dump functionality (#3600, #3603)
    • Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
    • Document LambdaMART objectives, both pairwise and listwise (#3672)
    • Document aucpr evaluation metric (#3687)
    • Document gblinear parameters: feature_selector and top_k (#3780)
    • โž• Add instructions for using MinGW-built XGBoost with Python. (#3774)
    • โœ‚ Removed nonexistent parameter use_buffer from documentation (#3610)
    • โšก๏ธ Update Python API doc to include all classes and members (#3619, #3682)
    • ๐Ÿ›  Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
    • Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
    • ๐Ÿ–จ Process data once for Python and CLI examples of learning to rank (#3666)
    • Include full text of Apache 2.0 license in the repository (#3698)
    • ๐Ÿ’พ Save predictor parameters in model file (#3856)
    • ๐Ÿ“ฆ JVM packages
      • Let users specify feature names when calling getModelDump and getFeatureScore (#3733)
      • Warn the user about the lack of over-the-wire encryption (#3667)
      • Fix errors in examples (#3719)
      • Document choice of trackers (#3831)
      • Document that vanilla Apache Spark is required (#3854)
    • ๐Ÿ“ฆ Python package
      • Document that custom objective can't contain colon (:) (#3601)
      • Show a better error message for failed library loading (#3690)
      • Document that feature importance is unavailable for non-tree learners (#3765)
      • Document behavior of get_fscore() for zero-importance features (#3763)
      • Recommend pickling as the way to save XGBClassifier / XGBRegressor / XGBRanker (#3829)
    • ๐Ÿ“ฆ R package
      • Enlarge variable importance plot to make it more visible (#3820)

    ๐Ÿ’ฅ BREAKING CHANGES

    • โฌ†๏ธ External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all *.page files before resuming training. Model serialization is unaffected.

    Known issues

    • Quantile sketcher fails to produce any quantile for some edge cases (#2943)
    • The hist algorithm leaks memory when used with learning rate decay callback (#3579)
    • Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
    • Early stopping doesn't work with gblinear learner (#3789)
    • Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the DMatrix object and re-load.
    • The DMatrix Python objects are initialized with incorrect values when given array slices (#3841)
    • ๐Ÿ‘ The gpu_id parameter is broken and not yet properly supported (#3850)

    Acknowledgement

    Contributors (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)

    First-time Contributors (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH

    First-time Reviewers (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)