Interview with Florina Piroi, Dipl.-Ing. Dr.
Researcher in the Research Group of Information and Software Engineering at the Institute of Information Systems Engineering
The effort to publish support data for a publication is underestimated
What is your field of research?
Domain specific information retrieval, Data Science, Text Processing.
Can you give us examples of how you use data management in your everyday work?
When data is involved, the larger data sets are located on some server available to us in our institute where code is running. That is: data should be close to where the code is running. Processed data is left on that server with usually minimal documentation (a text “readme” file that lists the author, when lucky a description of the folder contents, type of files in the folder, source). Benchmark data, which in Information Retrieval are called test collections, and are used to evaluate machine learning and/or IR algorithms, we try to make public with entries in data citation repositories, like GitHub or Zenodo.
An example of a test collection – currently available on an IFS server – is the CLEF-IP test collection of patent documents. Another collection of patent documents created by us and available to the Information Retrieval community is located on the TREC servers at NIST. On Zenodo, I manage another collection of patent data available to researchers, the WPI Patent Test Collection, though I was not involved in creating this data set.
Are you using data repositories for data publication?
Yes and no. I should :) Research I am involved with starts from existing sets of data/benchmarks, mostly available somewhere on the internet/other repositories. Data we use often for exercises with students, MSc subjects, etc. are on some server at our institute and accessible to our research group. The transformations done to this data, during a research project, are not stored outside the server where the experiments ran. I don't see that it should be done in all cases. Lots of it can be seen as intermediate results or data sets.
Leaving the data there and not publishing it on some repository has different reasons:
- no procedures established or enforced in our group
- it takes quite some effort to prepare data that is apt for publication
- the paper writing-reviewing-publication process is such that by the time the paper where some data was used/created is accepted, the interest to prepare the data for publication is gone (this could be alleviated by addressing point 2 before).
For convenience, and if data is not too big, I place it on our institute webserver under a location which I know will last longer than my job at the TU. As stated earlier, I am in charge of two such "repositories": The Clef-IP patent data (on our institute's website) and on the WPI patent collection, which is available on Zenodo.
You mentioned the readme-files that are stored with your processed data. Which information is particularly relevant for the description of your data?
Sometimes it would be nice to have support data that is auxiliary to a scientific publication, but the effort to publicize it is underestimated. Such data should be accompanied by documentation that describes it in terms of file types and format, of encoding, content of the files – is it structured? If yes, how? etc. Statistics about the data are also extremely useful if re-use is aimed for. For example if a collection of patent documents is published, for a researcher in text retrieval it is of importance to know the distribution of the document language – how many documents are there in English, in French, in German – or which technological areas are covered, etc.
Such data documentation reports span over pages, and summaries are sometimes included in scientific publications that use that data. Researchers must be reminded that this type of documentation should be created not after the research is finished, but during the data processing and while the research is carried out. Because time is so precious, this activity is, sadly, skipped, though it may turn out to be a nagging issue when other researchers ask for access to and details about data mentioned in publications.