Recommendations
The recommendations of the “Rise of Data” project given below are meant to provide a balanced view of the opinions in the materials community on the role of data in materials research. The recommendations were compiled from working group discussion at the UMD Workshop, e-mails sent to , and the online discussion forum. These recommendations will continue to evolve as new input is received until the Position Paper is formally published.DRAFT — 25-Jan-2016
Theme 1 — Materials Cyberinfrastructures
- The group assembled at the UMD workshop was overly focused on atomic-level materials research. It is important to engage the entire materials research community in the discussion. For example, the continuum materials research sub-community is well organized and can be looked to as an example of best practices.
- Better social networking communication is needed within the community. iMechanica is a good example. However, the materials community may be too diverse for a single effort of this kind to be highly successful, therefore materials sub-discipline social networking systems should be encouraged.
- The materials research community has a lot of data sets that are only of interest to small subsets of the community. This complicates the task of producing a cohesive community wide data strategy.
- Professional recognition is extremely important and a mechanism for recording and tracking credit for contributions is required. An example where this is currently being done in other fields is the Stack Overflow platform.
- Whatever cyberinfrastructures are developed they need to be flexible enough to adapt to technology advances over the next 10 or so years. Further, this will require sustained and intensive development efforts that must have a reliable funding source.
- Combining experimental, computational and theoretical data in each resource or database should be encouraged with search tools to assist users to find data of interest.
- Data needs to be discoverable such as it is in Google Scholar. Significant effort should be encouraged to develop and deploy robust search and query schemes that are capable of accessing a federated system of databases and returning results of high-interest.
- Databases are not always necessary; sometimes smaller data sets are better served by simple webpages. Form should follow the content. The community should be encouraged to share their data in the most useful and expedient form and then to register the existence of their data sets with more centralized resource managers to ensure its discoverability.
- Translation tools to transfer data between different systems are important and the development and sharing of such tools is a high value activity that should be encouraged.
- Workflow tools are good for forcing users to organize their data in a specified way and to archive their data.
- Data generated with high-throughput schemes often need to be reviewed for quality. This requires significant domain knowledge that, currently, can only be performed by a human with the required knowledge. This is not scalable, so we need to develop new "automated data quality testing" techniques. These might be based on machine learning methods. For smaller data sets, having tools that allow for quality endorsement by domain experts might be useful.
- Journals for "Data Publications" already exist, but are limited. More and better ways of uniquely labeling, discovering, and distributing data sets are needed.
- It is very difficult to get members of the community to contribute to existing database projects. In many cases such projects end up containing only data from the group(s) that created the system. One idea is to have Federal funding agencies provide "glue" funding to encourage researchers to contribute their data sets to existing database projects. However, this could have the effect of "picking the winners" by having funding agencies target certain projects for support.
Theme 2 — Enabling Infrastructure Creation
- A monolithic top-down infrastructure is not likely to be effective for the materials research community due to the highly heterogeneous nature of its data sets and the diversity of how different segments of the community use the data.
- The most important issues that were identified involve (i) how to create incentives for researchers to contribute to data-sharing projects; (ii) how to overcome technical barriers to the creation of cyberinfrastructures; (iii) how to record, track, and reward development contributions to such infrastructures; and (iv) how to provide, to new materials researchers, the training necessary to use and/or create such cyberinfrastructures.
- There is a need for simple tools that everyone can use on a day-to-day basis for collecting, organizing, analyzing, and archiving data sets.
- A cyberinfrastructure must have obvious advantages for a researcher to want to use and contribute to the resource.
- Data provenance and context metadata is critical for data discoverability, quality control, and for enabling appropriate use of the available data.
- Credit attribution must be visible, assured, and systematized in order to incentivize researchers to invest the necessary effort to contribute to such systems.
- There are major technical barriers to incorporating data collection and management into a researcher's day-to-day activities. Productivity-enhancing data management software tools, in analogy with version control software such as Git, are needed to help the researcher "live in the database" during their day-to-day efforts.
- Pilot projects for well-defined and mature subsections of the materials research community would be good incubators for exploring various approaches to the creation of cyberinfrastructures.
- Education on data skills will need to continue to grow and be refined in order for there to be a critical mass of materials researchers who are capable of recognizing the potential of such infrastructures and to be able to realize that potential.
- A call for topical data sets in grand challenge problems could encourage people to participate because of the attention paid to contributors of crucial data to high-visibility efforts.
Theme 3 — Data Management and Handling
- Data in Materials Research is characterized by large dimensionality as opposed to the more common large quantity. This should be treated as a unique aspect of Materials Data and advertised as an opportunity for machine learning, artificial intelligence, and computer science researchers.
- The following pressing issues were identified: (i) Ease of use for input/output of data management and handling; (ii) Costs must be minimized for data management and handling; (iii) Metadata needs to be handled as well as data as part of a researcher's workflow; (iv) Attribution/recognition of effort are important as incentives; and (v) Development of collaboration with researchers in data-centric fields.
-
Recommendations for community actions:
- Define a common minimal metadata to accompany a data set. Investigate Dublin Core as a possibility and consider extensions for Materials data sets
- Encourage the addition of uncertainty values in data or metadata sets.
- Advocate for the use of persistent identifiers.
- Accept and use data citation in bibliographies for dataset attribution.
- Organize interdisciplinary symposia with data-centric research fields.
-
Recommendations for funders actions:
- Fund interdisciplinary symposia with data-centric research fields.
- Perhaps have focused proposal tracks where only third party data is used.
- Request and reward data handling practices in proposal applications that line up with the proposed/adopted community actions.
Theme 4 — Knowledge from Data
- Data is not sufficient; knowledge is the key.
- Reproducibility of materials science is a key component of research and it is important to ensure that data is accompanied by as much metadata as necessary to enable reproducibility.
- The development of close collaboration between domain specialists and data specialists in a back-and-forth exchange of ideas, analysis, and results at all stages of a scientific investigation should be highly encouraged.
- Funding agencies should not mandate that everything be stored (publicly or privately). There should be a (federally supported) place where data can be stored.
- The idea of "descriptors" of materials systems as an encoding of "knowledge" may be a useful way of thinking.
- Thought needs to be given to what data is useful to keep around and make available. Especially in computational science, where lots of data is generated while codes and models are being developed; it is not clear what value such data could have to the general research community.
- It might be good to support one group who can generate data very, very carefully as opposed to having lots of independent data sets.
- Negative results are useful, but challenges/resistance to making these available exist. The community could change this by encouraging researchers to make these results and data available.
Theme 5 — Education of Materials Researchers
- Much of the community does not currently understand the value of data science, so an effort is needed to convince researchers and faculty that data science tools are useful. The creation of exemplars to illustrate the potential of data science tools could help convince the broad community of the importance of these tools.
- Teaching students critical thinking skills and the core concepts of materials science is crucial. The inclusion of data-centric educational material in the classroom should not detract from these goals; rather it should provide opportunities to enhance student acquisition of these skills and knowledge.
- Experience with the development of computational materials science curriculum suggests that we should not "force" educational bodies to include a data science course, but instead encourage the development of elective course offerings or the integration of data-centric modules/exercises in existing courses.
- To take advantage of data in materials science, we must train our researchers to be able to speak to data scientists. This allows for cross-pollination of ideas between these fields.
- Work is needed to develop guidelines that help students put their data into an appropriate form for sharing, including all important metadata and the "context" of their data.
- Libraries have a great opportunity to redefine their role in the new data-centric research age. Libraries are great places to store data for long term archiving, but may not be able to provide necessary high-volume high-speed access or other domain specific indexing and/or analysis capabilities.
- Librarians can help organize/index data and can help find and access data outside of the local institution.
Theme 6 — Grassroots Standards and Government Support
- Funding agencies and community leadership constitute the two primary top level entities that can have a significant impact on the direction taken by the materials research community in regard to the adoption of data science methods and tools.
- The task of establishing a federated system of materials databases is too large to relegate to volunteer efforts and requires a sustained funding initiative across the US federal research funding agencies.
- These databases should be populated through a community effort, but researchers may need supplementary funding to encourage and support participation in this activity.
- The community leadership needs to develop standards for archiving, distributing, and crediting materials data sets. For each identifiable class of materials data, a standard should be defined for the minimal set of data and metadata required for independent investigators to verify the accuracy/reliability of a data set.
- Community leaders should compile success stories to convince the materials research community that data sharing is important and will lead to accelerated discovery and innovation within the community. [As a first step, Success Stories are being collected here.]
- The creation of a "Materials Institute" may be ineffective due to the extreme diversity of the materials research community and a single institute would not be able to sufficiently represent all aspects of materials data.