Artifact Packaging and Importing Best Practices and Considerations

The SEARCCH hub catalogs metadata about artifacts. It does not store artifacts for you. Before you can import any artifact into SEARCCH, it must first be published on the Internet using a valid, accessible URL. Artifacts published on github, ACM digital library, ieeexplore, USENIX conference publication, arxiv, paperswithcode, Zenodo, and https-accessible git repositories can be imported using the assisted artifact import function. 

Packaging and publication needs differ based on artifact type. How you package your artifact is up to you. We discuss some things here for your consideration that could help others when trying to review and/or use your artifact.

Publications and Presentations

Publications and presentations usually need no packaging since they often consist of a single file. These are typically published on a conference, journal, or organization web site. We recommend using a published document's official DOI link, if it has one, as the primary artifact URL.

Sometimes publications and presentations have open access and sometimes they are behind a paywall (e.g., many ACM publications.) In the case of a paywall, you can add a File URL as metadata that points to an open access copy of the paper or presentation on a personal or other open website. 

Software and Datasets

Packaging

Software and data artifacts should be shared in the form in which they are most useful. Each artifact will have its own unique form and sub-components. In general, the more complete the artifact package is, including accurate and complete documentation, the more likely that others can reuse your artifact without your assistance.

Software Packages should include source files, configuration files, a dependency list (or copies of dependencies where appropriate), and documentation, at a minimum. Documentation should include assumptions, dependencies, how to install, how to use, any applicable licenses, search keywords, and contacts for questions. 

Software stored in git repositories should include a README.md file, preferably using the structure in one of our provided templates [INSERT LINK HERE]. Our automated artifact importer tool reads and parses any README.md file, so the more closely it matches one of our templates, the better job the importer will do. 

Some scientific disciplines also have unique aspects that should be considered. For example, ML-based efforts have training data, evaluation code, models, etc. PapersWithCode have a set of community recommendations for packaging ML-code.

Experiment setups can be packaged with research algorithm software or separately, if desired. If an experiment setup could be useful to other researchers in the future, separate packaging may make sense (e.g, an experiment around network-based botnet command-and-control detection). Any experiment prerequisites and assumptions should be explicitly stated. Include the software and configuration files you used for setup, collection, reformatting, and analysis, along with software descriptions and where each is used in the experiment pipeline and anything else someone would need to reproduce your experiment. Also consider sharing results tables or graphs along with the procedures you used to generate them. Providing this level of information can help other researchers more quickly reuse your experiment setup.

Finally, consider providing a deployable version of your software artifact using a docker image, virtual machine image, or other similar container image. This will ensure the software is installed and configured as was used in your research and will facilitate rapid reuse. 

Datasets Packages should include the actual data, a description of the data (including data fields), how the data was collected or generated, any licensing restrictions, search keywords, and contacts for questions. It may also be helpful to include citations for works that were based on the data. Datasets stored in git repositories should document the afore mentioned items in a README.md file, preferably using a structure in one of our provided templates [INSERT LINK HERE]. Our automated artifact importer tool reads and parses any README.md file, so the more closely it matches one of our templates, the better job the importer will do. 

Providing artifacts using common or well-known formats will aid in reuse.  If a proprietary format is necessary, include a pointer to a tool that will aid in accessing the data will facilitate quicker reuse (e.g., a python data parser).   

Publishing on the Intermet

Software and datasets may be published anywhere on the Internet, including on personal web sites. Please do consider long term storage when determining where to publish these artifacts.  If using a personal web site and that site goes away, the artifact will no longer be accessible to the community. We recommend using a professional service that is committed to long-term preservation, where possible.

When published on github.com, Zenodo, or generic internet accessible git repositories, the SEARCCH import assistant may be used to help with the import process. Some possible publishing locations include:

  1. Zenodo - versioning, 50GB, free bandwidth, DOI, and long-term preservation
  2. GitHub - versioning, 2GB file limit, and free bandwidth
  3. OneDrive - versioning, 2GB (free)/ 1TB (with Office 365), free bandwidth
  4. Google Drive - versioning, 15GB, free bandwidth
  5. Dropbox - versioning, 2GB (paid unlimited), free bandwidth
  6. AWS S3 - versioning, paid only, paid bandwidth
  7. DAGsHub - a way to track experiments, version data, models & pipelines, using Git

Importing to SEARCCH

When importing software or dataset artifacts to SEARCCH, we recommend using the software/dataset repository or website URL as the primary artifact URL.

You can add FILE URL metadata links that point to specific parts of the artifact. For example, you can create links directly to the source file folder and to the main documentation (e.g., README.md of a git repository). It is not necessary to create a FILE URL for every individual source file.  If desired, you can create a FILE URL entry for each subcomponent.  For datasets, you can add FILE URL links that point directly to the data file(s) and to the dataset documentation.