Abstract
Background. Associative Classification, a combination of two important and different fields (classification and association rule mining), aims at building accurate and interpretable classifiers by means of association rules. A major problem in this field is that existing proposals do not scale well when really Big Data are considered. In this regard, the aim of this work is to propose adaptations of well-known associative classification algorithms (CBA and CPAR) by considering different Big Data platforms (Spark and Flink). %In this regard, an experimental study has been performed selecting CBA among state-of-art since is obtained the best trade-off between interpretability and accuracy in the classifiers. Furthermore, as CBA could take a large time to be run by the derived complexity of its exhaustive search, CPAR has been selected as the another approach which also obtains very accurate classifiers in a reduced quantum of time by means of a fast greedy algorithm.
Results. An experimental study has been performed on 30 datasets and the results have been analyzed by means of non-parametric tests. Results proved that CBA-Spark and CBA-Flink obtained really interpretable classifiers but it was more time consuming than CPAR-Spark or CPAR-Flink. In this study, it was demonstrated that the proposals were able to run on truly Big Data (file sizes up to 200 GBytes). Finally, the analysis of different quality metrics revealed that no statistical difference can be found for these two approaches.
Conclusions. The experimental study has revealed that sequential algorithms cannot be used on large quantities of data and approaches such as CBA-Spark, CBA-Flink, CPAR-Spark or CPAR-Flink are required. CBA has proved to be very useful when the main goal is to obtain highly interpretable results. However, when the runtime has to be minimized CPAR should be used. No statistical difference could be found between the two proposals in terms of quality of the results except for the interpretability of the final classifiers, CBA being statistically better than CPAR.
Additional Information
Results for the accuracy measure:
Datasets | CBA | CBA2 | CMAR | CPAR | C45 | Ripper | Core | OneR |
---|---|---|---|---|---|---|---|---|
appendicitis | 0.896 | 0.886 | 0.868 | 0.869 | 0.867 | 0.82 | 0.877 | 0.82 |
australian | 0.866 | 0.868 | 0.871 | 0.864 | 0.864 | 0.81 | 0.832 | 0.855 |
banana | 0.57 | 0.612 | 0.019 | 0.752 | 0.752 | 0.643 | 0.631 | 0.702 |
breast | 0.653 | 0.737 | 0.643 | 0.737 | 0.693 | 0.617 | 0.754 | 0.69 |
cleveland | 0.537 | 0.539 | 0.485 | 0.56 | 0.556 | 0.441 | 0.512 | 0.543 |
contraceptive | 0.448 | 0.461 | 0.1 | 0.559 | 0.546 | 0.523 | 0.434 | 0.409 |
flare | 0.67 | 0.675 | 0.446 | 0.754 | 0.703 | 0.672 | 0.646 | 0.614 |
german | 0.753 | 0.737 | 0.642 | 0.741 | 0.694 | 0.667 | 0.698 | 0.717 |
hayes-roth | 0.537 | 0.538 | 0.194 | 0.538 | 0.538 | 0.85 | 0.481 | 0.388 |
heart | 0.83 | 0.826 | 0.837 | 0.833 | 0.807 | 0.693 | 0.7 | 0.715 |
iris | 0.933 | 0.953 | 0.94 | 0.947 | 0.96 | 0.947 | 0.947 | 0.94 |
lymphography | 0.769 | 0.79 | 0.776 | 0.782 | 0.751 | 0.798 | 0.688 | 0.75 |
magic | 0.815 | 0.774 | 0.449 | 0.853 | 0.773 | 0.844 | 0.744 | 0.735 |
mammographic | 0.82 | 0.819 | 0.763 | 0.84 | 0.836 | 0.795 | 0.79 | 0.828 |
monk-2 | 0.97 | 0.972 | 0.972 | 0.972 | 0.972 | 1 | 0.913 | 0.806 |
mushroom | 0.99 | 1 | 0.998 | 0.994 | 1 | 1 | 0.799 | 0.984 |
page-blocks | 0.97 | 0.962 | 0.897 | 0.96 | 0.943 | 0.962 | 0.903 | 0.936 |
phoneme | 0.805 | 0.751 | 0.444 | 0.823 | 0.799 | 0.822 | 0.754 | 0.766 |
pima | 0.727 | 0.751 | 0.521 | 0.772 | 0.764 | 0.701 | 0.742 | 0.747 |
post-operative | 0.556 | 0.59 | 0.589 | 0.622 | 0.6 | 0.4 | 0.69 | 0.703 |
saheart | 0.67 | 0.697 | 0.35 | 0.727 | 0.695 | 0.601 | 0.707 | 0.692 |
spectfheart | 0.798 | 0.854 | 0.776 | 0.836 | 0.836 | 0.76 | 0.795 | 0.795 |
splice | 0.938 | 0.73 | 0.887 | 0.947 | 0.905 | 0.928 | 0.519 | 0.613 |
tae | 0.25 | 0.252 | 0 | 0 | 0 | 0.595 | 0.437 | 0.252 |
tic-tac-toe | 1 | 1 | 0.975 | 0.978 | 0.865 | 0.977 | 0.689 | 0.7 |
titanic | 0.74 | 0.741 | 0.067 | 0.776 | 0.776 | 0.705 | 0.783 | 0.776 |
vehicle | 0.69 | 0.695 | 0.52 | 0.702 | 0.637 | 0.715 | 0.375 | 0.551 |
wine | 0.938 | 0.989 | 0.983 | 0.989 | 0.961 | 0.91 | 0.944 | 0.826 |
winequality-white | 0.45 | 0.45 | 0 | 0.555 | 0.486 | 0.538 | 0.446 | 0.484 |
wisconsin | 0.96 | 0.965 | 0.963 | 0.956 | 0.956 | 0.966 | 0.934 | 0.927 |
Results for the kappa measure:
Datasets | CBA | CBA2 | CMAR | CPAR | C45 | Ripper | Core | OneR |
---|---|---|---|---|---|---|---|---|
appendicitis | 0.54 | 0.547 | 0.516 | 0.481 | 0.452 | 0.461 | 0.58 | 0.447 |
australian | 0.72 | 0.727 | 0.738 | 0.72 | 0.732 | 0.622 | 0.661 | 0.709 |
banana | 0.045 | 0.147 | 0.01 | 0.498 | 0.499 | 0.328 | 0.219 | 0.383 |
breast | 0.65 | 0.289 | 0.167 | 0.205 | 0.242 | 0.171 | 0.295 | 0.163 |
cleveland | 0.28 | 0.291 | 0.238 | 0.219 | 0.208 | 0.206 | 0.138 | 0.17 |
contraceptive | 0.127 | 0.159 | 0.056 | 0.302 | 0.317 | 0.272 | 0.074 | 0.066 |
flare | 0.56 | 0.577 | 0.373 | 0.679 | 0.672 | 0.588 | 0.545 | 0.493 |
german | 0.244 | 0.33 | 0.105 | 0.264 | 0.294 | 0.243 | 0.005 | 0.152 |
hayes-roth | 0.28 | 0.284 | 0.152 | 0.284 | 0.284 | 0.732 | 0.193 | 0.056 |
heart | 0.66 | 0.641 | 0.667 | 0.656 | 0.615 | 0.408 | 0.375 | 0.421 |
iris | 0.91 | 0.928 | 0.902 | 0.916 | 0.892 | 0.91 | 0.913 | 0.902 |
lymphography | 0.55 | 0.595 | 0.571 | 0.563 | 0.505 | 0.613 | 0.331 | 0.515 |
magic | 0.24 | 0.428 | 0.213 | 0.667 | 0.644 | 0.647 | 0.432 | 0.424 |
mammographic | 0.63 | 0.636 | 0.582 | 0.677 | 0.693 | 0.589 | 0.576 | 0.652 |
monk-2 | 0.944 | 0.945 | 0.945 | 0.945 | 0.945 | 1 | 0.823 | 0.602 |
mushroom | 0.99 | 1 | 0.995 | 0.988 | 1 | 1 | 0.524 | 0.967 |
page-blocks | 0.57 | 0.778 | 0.021 | 0.801 | 0.82 | 0.811 | 0.115 | 0.64 |
phoneme | 0.21 | 0.49 | 0.166 | 0.572 | 0.548 | 0.605 | 0.417 | 0.449 |
pima | 0.46 | 0.48 | 0.232 | 0.462 | 0.483 | 0.393 | 0.36 | 0.357 |
post-operative | -0.17 | -0.161 | -0.097 | -0.148 | -0.029 | -0.221 | -0.043 | -0.029 |
saheart | 0.1 | 0.163 | 0.12 | 0.293 | 0.333 | 0.187 | 0.25 | 0.21 |
spectfheart | 0.2 | 0.565 | 0.229 | 0.343 | 0.411 | 0.39 | 0 | 0 |
splice | 0.89 | 0.552 | 0.822 | 0.914 | 0.902 | 0.882 | 0.001 | 0.377 |
tae | 0.25 | 0 | 0 | 0 | 0 | 0.341 | 0.174 | 0 |
tic-tac-toe | 1 | 1 | 0.945 | 0.952 | 0.706 | 0.949 | 0.313 | 0.34 |
titanic | 0.25 | 0.255 | 0.045 | 0.433 | 0.433 | 0.373 | 0.455 | 0.433 |
vehicle | 0.69 | 0.59 | 0.415 | 0.604 | 0.597 | 0.615 | 0.17 | 0.4 |
wine | 0.98 | 0.982 | 0.973 | 0.982 | 0.866 | 0.86 | 0.91 | 0.726 |
winequality-white | 0 | 0.009 | 0 | 0.289 | 0.313 | 0.319 | 0.04 | 0.162 |
wisconsin | 0.91 | 0.92 | 0.917 | 0.903 | 0.915 | 0.927 | 0.854 | 0.841 |
Results for complexity
Datasets | cba | cba2 | cpar | c45 | ripper |
---|---|---|---|---|---|
appendicitis | 18.11 | 11.41 | 33.61 | 8 | 46.2 |
australian | 343.63 | 327.93 | 670.96 | 444 | 1439.2 |
banana | 0 | 8.28 | 370.25 | 594 | 297.6 |
breast | 131.79 | 159.53 | 559.67 | 504 | 2282 |
cleveland | 81.97 | 131.05 | 322.05 | 440 | 3411.2 |
contraceptive | 22.25 | 45.44 | 1146.82 | 1260 | 18631.9 |
flare | 3 | 65.82 | 1233.16 | 2349 | 44761.5 |
german | 754.64 | 968.18 | 2734.47 | 3844 | 3630 |
hayes-roth | 4 | 4 | 58.09 | 24 | 159.6 |
heart | 164.21 | 148.44 | 307.26 | 341 | 294 |
iris | 13.7 | 5.68 | 49.95 | 9 | 10.2 |
lymphography | 119.84 | 94.81 | 256.86 | 240 | 192 |
magic | 760.81 | 1195.78 | 28644.7 | 784700 | 20997.9 |
mammographic | 44.83 | 38.05 | 263.56 | 50 | 2373.3 |
monk-2 | 4 | 4 | 56.14 | 12 | 8 |
mushroom | 34.13 | 53.97 | 132.72 | 630 | 6 |
page-blocks | 525.47 | 708.7 | 1438.25 | 17325 | 5611.2 |
phoneme | 20.92 | 77.04 | 4225.44 | 24864 | 8522.5 |
pima | 92.1 | 96.94 | 557.19 | 308 | 1800 |
post-operative | 49.2 | 87.99 | 111.34 | 32 | 832.5 |
saheart | 28.64 | 21.96 | 232.78 | 60 | 1236.6 |
spectfheart | 0 | 67.03 | 287.86 | 286 | 155.2 |
splice | 597.58 | 198 | 1964.57 | 69185 | 5959.5 |
tae | 0 | 0 | 0 | 0 | 1320 |
tic-tac-toe | 27 | 27 | 490.03 | 8700 | 679.4 |
titanic | 4.2 | 4.2 | 117.81 | 6 | 103.2 |
vehicle | 241.45 | 193.29 | 2173.12 | 9898 | 4429.6 |
wine | 15.16 | 15.16 | 79.25 | 91 | 26 |
winequality-white | 0 | 75.58 | 16805.92 | 80375 | 115628.8 |
wisconsin | 123.48 | 87.26 | 157.08 | 220 | 117.6 |
Results for time
Datasets | CBA | CBA2 |
---|---|---|
appendicitis | 623.66 | 1265.29 |
australian | 35850.5 | 39850.18 |
banana | 1563 | 1658.86 |
breast | 1300.5 | 1552.12 |
cleveland | 1340.61 | 1489.87 |
contraceptive | 1816.87 | 1919.75 |
flare | 3892.43 | 37613.55 |
german | 229999 | 164956.76 |
hayes-roth | 561.41 | 578.84 |
heart | 2500 | 2625.54 |
iris | 3789 | 49144 |
lymphography | 9260.17 | 9907.56 |
magic | 13716.98 | 108535.81 |
mammographic | 15897 | 11971.96 |
monk-2 | 9000 | 9980.83 |
mushroom | 52899 | 28932.19 |
page-blocks | 75585 | 108800 |
phoneme | 3295.25 | 4543.03 |
pima | 1276.21 | 1336.04 |
post-operative | 954.12 | 1730.27 |
saheart | 957.54 | 983.08 |
spectfheart | 798759 | 806434 |
splice | 87800 | 15584.62 |
tae | 900 | 1100 |
tic-tac-toe | 1513.13 | 2115.06 |
titanic | 3563.22 | 2012.97 |
vehicle | 204390.61 | 2001850.57 |
wine | 15509.69 | 159143.47 |
winequality-white | 4497.72 | 25046.86 |
wisconsin | 1812.8 | 1949.86 |
Algorithms
CBA
Pseudocode from this algorithm can be found in this link. Pseudocode of generation of rules:
Generation of rules step1:
Generation of rules step2:
CPAR
Pseudocode from this algorithm can be found in this link