Self-host a multilabel archive

Cometa is free open source software, so you can self-host your own multilabel archive. If you have a family of multilabel datasets you want to self-host, Cometa will make this a breeze.

Preparing your datasets

Cometa uses mldr software to extract metadata and partition datasets, so the first step is to install these R packages and import your datasets:

install.packages(c("mldr", "mldr.datasets"))

# This command reads mydataset.arff and mydataset.xml
mydataset = mldr("mydataset")

# Save your dataset in an R-friendly format
saveRDS(mydataset, file = "public/full/mydataset.rds")

Please refer to the mldr::mldr documentation for other ways to import your data.

Running Cometa

Cometa is easy to self-host thanks to its Docker image. First, install Docker by following the instructions on the official site. Now run Cometa in automatic mode by running the following command in your terminal:

mkdir public
docker run -dp 8080:8080 \
  --volume "$(pwd)/public:/usr/app/public" \
  --name cometa1 \
  fdavidcl/cometa:latest

This runs Docker in detached mode so you will not see a log or a terminal prompt.

From now on, the web repository will be served at port 8080 (visit localhost:8080 in your browser to see the repository). Cometa will listen for new datasets in the folder public/pending in mldr (.rds) format. Any new incoming dataset will be partitioned and its metadata will be extracted in order to display it on the repository. Partitions will be created inside public/partitions, and the full dataset will be converted to several formats in public/full.

Running Cometa in interactive mode

If you do not want Cometa to run in automatic mode, you can launch it interactively by changing -dp for -itp in the previous docker run command. Use --rm as well in order to throw away your container when you've finished using it:

mkdir public
docker run --rm -itp 8080:8080 \
  --volume "$(pwd)/public:/usr/app/public" \
  --name cometa1 \
  fdavidcl/cometa:latest

Wait for the download to complete, and you should see a welcome message and some options to choose:

  1. Automatic: default behavior in non-interactive mode.
  2. Only partitioning: partitions all datasets in public/pending and exits the program.
  3. Only serve website: launches a web server for the included dataset repository.
  4. Drop to a terminal: advanced mode.
  5. Quit

The second option will run different partitioning and cross-validation strategies through your data. Bear in mind this could take from minutes to several hours depending on the size of the datasets.