Title: | Space Filling Based Tools for Data Mining |
---|---|
Description: | Contains space filling based tools for machine learning and data mining. Some functions offer several computational techniques and deal with the out of memory for large big data by using the ff package. |
Authors: | Mohamed Laib and Mikhail Kanevski |
Maintainer: | Mohamed Laib <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-10-31 16:32:50 UTC |
Source: | https://github.com/mlaib/sftools |
Contains space filling based tools for machine learning and data mining. Some functions offer several computational techniques and deal with the out of memory for large big data by using the ff package.
Mohamed Laib [email protected] and
Mikhail Kanevski [email protected],
Maintainer: Mohamed Laib [email protected]
M. Laib, M. Kanevski, A novel filter algorithm for unsupervised feature selection based on a space filling measure. Proceedings of the 26rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pp. 485-490, Bruges (Belgium), 2018.
M. Laib and M. Kanevski, A new algorithm for redundancy minimisation in geo-environmental data, 2019. Computers & Geosciences, 133 104328.
J. A. Royle, D. Nychka, An algorithm for the construction of spatial coverage designs with implementation in Splus, Computers and Geosciences 24 (1997) p. 479–488.
J. Franco, Planification d’expériences numériques en phase exploratoire pour la simulation des phénomènes complexes, Thesis (2008) 282.
D. Dupuy, C. Helbert, J. Franco (2015). DiceDesign and DiceEval: Two R Packages for Design and Analysis of Computer Experiments. Journal of Statistical Software, 65(11), 1-38. Jstatsoft.
Useful links:
Report bugs at https://github.com/mlaib/SFtools/issues
Generates a simulated data set
SimData(n=1000)
SimData(n=1000)
n |
Number of generated data points (by default: |
A data.frame
of simulated data set, with features
(
of them are redundants)
Sim_Data<-SimData(n=1000) plot(Sim_Data$x1,Sim_Data$x2) ## Not run: #### Visualisation of the data set (3D) #### require(rgl) require(colorRamps) c <- cut(Sim_Data$z,breaks=100) cols <- matlab.like(100)[as.numeric(c)] plot3d(Sim_Data$x1,Sim_Data$x2,Sim_Data$z,radius=0.01, col=cols, type="s",xlab="x1",ylab="x2",zlab="z",box=F) grid3d(c("x","y","z"),col="black",lwd=1) ## End(Not run)
Sim_Data<-SimData(n=1000) plot(Sim_Data$x1,Sim_Data$x2) ## Not run: #### Visualisation of the data set (3D) #### require(rgl) require(colorRamps) c <- cut(Sim_Data$z,breaks=100) cols <- matlab.like(100)[as.numeric(c)] plot3d(Sim_Data$x1,Sim_Data$x2,Sim_Data$z,radius=0.01, col=cols, type="s",xlab="x1",ylab="x2",zlab="z",box=F) grid3d(c("x","y","z"),col="black",lwd=1) ## End(Not run)
Applies the UfsCov algorithm based on the space filling concept, by using a sequatial forward search (SFS).
UfsCov(data)
UfsCov(data)
data |
Data of class: |
Since the algorithm is based on pairwise distances, and according to the computing power of your machine, large number of data points can take much time and needs more memory.
A list of two elements:
CovD
a vector containing the coverage measure of
each step of the SFS.
IdR
a vector containing the added variables during
the selection procedure.
The algorithm does not deal with missing values and constant features. Please make sure to remove them.
Mohamed Laib [email protected]
M. Laib, M. Kanevski, A novel filter algorithm for unsupervised feature selection based on a space filling measure. Proceedings of the 26rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pp. 485-490, Bruges (Belgium), 2018.
M. Laib and M. Kanevski, A new algorithm for redundancy minimisation in geo-environmental data, 2019. Computers & Geosciences, 133 104328.
Sim_Data<-SimData(n=800) Results<- UfsCov(Sim_Data) cou<-colnames(Sim_Data) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## Not run: #### UfsCov on the Butterfly dataset #### require(IDmining) N <- 1000 raw_dat <- Butterfly(N) dat<-raw_dat[,-9] Results<- UfsCov(dat) cou<-colnames(dat) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## End(Not run)
Sim_Data<-SimData(n=800) Results<- UfsCov(Sim_Data) cou<-colnames(Sim_Data) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## Not run: #### UfsCov on the Butterfly dataset #### require(IDmining) N <- 1000 raw_dat <- Butterfly(N) dat<-raw_dat[,-9] Results<- UfsCov(dat) cou<-colnames(dat) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## End(Not run)
Applies the UfsCov algorithm based on the space filling concept, by using a sequatial forward search (SFS).This function offers a parellel computing.
UfsCov_par(data, ncores=2)
UfsCov_par(data, ncores=2)
data |
Data of class: |
ncores |
Number of cores to use (by default: |
Since the algorithm is based on pairwise distances, and according to the computing power of your machine, large number of data points needs more memory.
A list of two elements:
CovD
a vector containing the coverage measure of
each step of the SFS.
IdR
a vector containing the added variables during
the selection procedure.
The algorithm does not deal with missing values and constant
features. Please make sure to remove them. Note that it is not recommanded to
use this function with small data, it takes more time than using the
standard UfsCov
function.
Mohamed Laib [email protected]
M. Laib, M. Kanevski, A novel filter algorithm for unsupervised feature selection based on a space filling measure. Proceedings of the 26rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pp. 485-490, Bruges (Belgium), 2018.
M. Laib and M. Kanevski, A new algorithm for redundancy minimisation in geo-environmental data, 2019. Computers & Geosciences, 133 104328.
N <- 800 dat<-SimData(N) Results<- UfsCov_par(dat,ncores=2) cou<-colnames(dat) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## Not run: N<-5000 dat<-SimData(N) ## Little comparison: system.time(Uf<-UfsCov(dat)) system.time(Uf.p<-UfsCov_par(dat, ncores = 4)) ## End(Not run)
N <- 800 dat<-SimData(N) Results<- UfsCov_par(dat,ncores=2) cou<-colnames(dat) nom<-cou[Results[[2]]] par(mfrow=c(1,1), mar=c(5,5,2,2)) names(Results[[1]])<-cou[Results[[2]]] plot(Results[[1]] ,pch=16,cex=1,col="blue", axes = FALSE, xlab = "Added Features", ylab = "Coverage measure") lines(Results[[1]] ,cex=2,col="blue") grid(lwd=1.5,col="gray" ) box() axis(2) axis(1,1:length(nom),nom) which.min(Results[[1]]) ## Not run: N<-5000 dat<-SimData(N) ## Little comparison: system.time(Uf<-UfsCov(dat)) system.time(Uf.p<-UfsCov_par(dat, ncores = 4)) ## End(Not run)