Funding agencies are increasingly requiring grant proposals to include data management plans, through which you describe and preserve data, and make it easier for others to access the information you have gathered, verify your results, compare studies, and incorporate findings into future work. Research development specialist Cheryl Dykstra-Aiello discusses the considerations you need to make as you plan your data management strategy to facilitate the exchange of knowledge and meet funding agency expectations.
My name is Cheryl Dykstra-Aiello. I’m a research development specialist in the Office of Research Advancement and Partnerships. And today I’m going to be talking about data management plans. So thank you for joining me.
So why does data management matter? So it matters because we want our research to be reproducible. We want to verify the results. When we may want to re reuse and compare our data to in follow up projects. And probably most importantly, we want to be able to secure funding. And many of the funding agencies now require data management plans.
The data also helps with tracking your research impact. It can make your research visible and findable. And it is actually a very efficient or core organizational tool to reduce administrative burden, particularly if you define those expectations at the beginning of your research project. Really I guess what I want to say is that managing your data and having a plan in place is really best practices for doing any research project.
It doesn’t matter whether a plan is required for funding or not. It’s just a an excellent thing to be doing. So continuing on with the question about why it matters, we’ve had a few high profile incidents at WSU and these are just a few of the the headlines about data breaches and retractions.
And so if you’re managing your data and depositing it in a safe repository, then, hopefully you don’t have to worry too much about data breaches. And also, it just can be used to show that you haven’t faked your data and you don’t have to retract any articles. Historically, management of data has been done by individual departments in the various fields of study because of the varying types of data that are collected across disciplines.
But now there’s more of an increased interest in having best practices in place. And that’s across all disciplines and all institutions. So that data management is becoming more standardized and more centralized. So as I mentioned, that one of the important pieces of managing data or the reasons that you would want to manage your data and have a plan in place is to secure funding.
And this is just one example. NIH requires that their data be shared and that they have that any Pi is applying for funding from NIH have a plan, for managing their data. And I’ve included a link here about their management and sharing policy that you can check out. These slides will be sent out along with a link to the recording of today’s session.
So anything that’s on my slides that is underlined and in red font, You can link to websites through those. NSF also has a data management plan requirement and a policy for data sharing. And I encourage you to look at those. I will also mention that our Website the ORAP website has a research development toolbox.
And in that toolbox you can find templates for data management plans for the various agencies. If your call for funding mentions a data management plan and you can’t find a template on our site, feel free to reach out and we can develop a template for you. So some of the concerns about data management that you need to consider are the types of data, how much data you’re going to be collecting, and storage of that data. And how you’re going to make it accessible and ensure that accessibility over time.
So not just now, but for a period of time. You want to know and report who will be responsible for making the decisions about data management. And how are you going to describe the data? So the descriptive data is really the metadata and you have to have practices in place for that. There was a survey that was done by OSU in 2013.
They had 20.6% or 443 of their faculty respond to this survey. And they asked about the types, volume, and storage of the data and found that most faculty who responded had their data in spreadsheets or delimited data and XML, and that with some images and some text. they had data that was mostly less than 100GB, and the storage was mostly less than five years on personal computers, external storage devices or
Servers within the research group. That storage varied across disciplines, as you can see that, here I’ve noted that on the college department server servers the data was stored mainly by faculty in agriculture, business, forestry, Vet med and pharmacy. While faculty in education used cloud storage for their data. Campus wide data storage was really not used very much.
As far as the tasks and the roles and who was responsible for the data, the data management itself was left up to the lab techs and research assistants while sharing and backup was provided by the PiS. And for metadata, the practices, they were either not really standardized or they were using locally created standard.
So, in another survey of psychology researchers that was done in 2020, they reported that participants really did have good data management practices, but that there was little standardization in the research group. And this is important for going forward and being able to duplicate any of the results that you get with your data.
So what are the elements of a data management plan? I’ve mentioned most of them already, but you need to include the type of data that you’re going to be collecting and the sources, where it’s going to be collected from. Any file formats and how you’re going to manage the data, your descriptive metadata, the roles and responsibilities associated with that project for data collection and storage. You need to report on the privacy of the data and the security of the data, where it will be stored, preserving and storage of the data.
So where it will be preserved and stored, you need to also Think about the ownership of the data and intellectual property and that that will become important. And I will, have a slide later on in this, presentation about that. And also public access and data sharing and reuse. How, when, where? The important thing is that although these are elements of a data management plan, you need to read the call, read the opportunity.
Because they may have other elements that they want to be reported in their data management plan. So just don’t go by this list. It’s kind of a general list. Read the call.
As we talk about data types and storage and storage, we want to know and you want to report in your data management plan the types of data that you’re going to generate. So will it be experimental data. Is it going to be raw or processed. Qualitative data. Are you going to be having the result of your activity
Will it be physical collections? Is it going to be digital data? And also consider if there’s specific software that’s involved in collecting the data. How much of the data are you going to be generating over the life of the project? And so what is the file size that you’re going to be accumulating and which data will you be sharing?
Because you may not be sharing all of the data that you collect. When will you be when will you be sharing it? The best practice for storing data is to store raw data. Don’t store the data that you’ve made calculations on from that raw data. If you’re using data from other sources, then report what the sources report the content of that data. and if there’s any conditions that are associated with it for obtaining that data and using it, report that as well.
And if you’ve got different data sets, then you need to report the relationship between those sets. Define your variables. You’re going to want to be able to go back yourself, even, to look at your data. And you want to know what exactly is being measured. Report the units that, of measurement. And if you’ve got standard operating procedures and hopefully you do you want to you want to make sure that those are in there as well being reported.
Do you have some confounding factors that you’re going to have in your data, for example, DNA? And does that mean that you’re going to need increased security? These are just some things that you need to consider. You don’t really want too much unstructured data. And I would suggest using software that enforces data types and ranges.
So don’t use font color or highlighting as part of your data. Also, when you’re using Excel, be very aware that gene or protein names can be converted to dates. This was a peeve of mine when I was an assistant research professor on the Spokane campus and reporting gene names. It just irked me that they got converted to dates so frequently.
And if you want a little bit more information on that I’ve provided a link to an article that you can read called Gene name errors are widespread in the scientific literature. And that is one of the reasons for that. So you can go and reference that if you want to. If you’re coding for your discrete variables make sure you’ve got a codebook and also code for any missing data but don’t use codes for continuous variables.
So, I just want to give an example of some untidy data versus tidy data. So using the suggestions that I presented in the previous slides, there’s an untidy data table. On the top of this left hand information box. And you can see it’s not really standardized for one thing. And it would be better going forward if you could tidy that up.
So make sure that your measurement units, weight and length are in the title, The column title, make sure that they’re all consistent. So in the untidy data you can see that some are listed in pounds, some are in kilograms. Be consistent. In the tidy data we’ve got the column title is weight underscore kilogram.
So you know that those numbers are all in kilograms. The same thing with the length. So in centimeters there. Make sure that your dates are structured. They’re definitely not structured in the untidy data. They’re listed in different ways and tidied up. And the, in the table underneath. So they’re all, listed as your month date rather than spelling out the month.
And doing it that way, different, different ways of presenting dates. Doesn’t matter how you present them, just be consistent with the way you’re doing it. You can also see that in the tidy data that the species are using a taxonomic serial number and then there is a codebook underneath. So there’s two tables underneath that tidy data table.
And those are code books. So they list what the species code represents in the actual table. So the same thing was done for station codes. So you see the citation code numbers in the in the main table. And then underneath is a code book that to tell you that station code one is a swamp gives the latitude and longitude there rather than in the main table.
So recommendations for your names. Use 6 to 8 characters and include measurement in the names. So just like they did with the weight and length with the underscore kilogram underscore centimeter, try to do that. There’s also another reference for you to look at if you want to ten simple rules for digital data storage. I’ve linked that here.
So continuing on with file management naming, I have two boxes. Follow the green one. Don’t do what’s in the red box. So, report your naming conventions in your data management plan and be consistent. I talked about that in the previous slide. Be consistent with your you with your naming as you do with your reporting of your data.
If you’re not using a version control software package, then make sure that you’re using version numbers as part of your naming. So V01 or v underscore one, or v one underscore zero. Going on two, three. Whatever. Be consistent with your dates, your project names, any initials that you’re using and your file names. Don’t change them up between, between file names.
And you can also to make sorting easier, start your naming with, leading zeros. So 001, 002 Do not rely on file directory structure to distinguish your files. And don’t use any spaces or special characters or periods as in the file names. Do not use ambiguous labels, don’t end your file name with final or revision.
Because what revision is it? And I can tell you from my own experience writing, we had different final versions. So, and so using those is really, very ambiguous going back and looking at your got your files. Organized in your files. I have a list of various, packages that you could use to organize your files.
I’m not going to go through them all. You can read them, but I will say that make sure you plan ahead. So don’t collect your data and then try to figure out what you’re going to be doing and how you’re going to manage it. Start right from the very beginning. And, put into place the practices that you’ll be using.
Don’t try to decide them later on. Both about how you’re going to describe your data and the documents that you’re going to, be keeping. Please note that the systems that I’ve listed on the left hand side are not directly comparable. And how you choose the system that you’ll use for file management really depends on the required compliance.
The needs that you have in your discipline and whether or not you’re going to be sharing openly with the data.
Considering your file formats, you really need to think long term preservation. And this is really important so that the data can be used over time. And not just now. So don’t use any proprietary file formats. So, for instance, don’t use PSD files, use Tiff Also don’t use JPEG, use TIFF because the JPEGs are non compressed formats use uncompressed unencrypted and uncompiled file formats.
When you are compiling your data and saving it. This Cornell Library has a table that gives file formats that are good for long term. And not good for long term. So they have high probability, median probability and low probability and just the file format types that you should be considering. This is just one row of that table.
And I’ve chosen text. So they really recommend that you use plain text if you’re going to be saving text documents. And I encourage you to check out that table linked here on this slide.
Talking about descriptive metadata, there was a another survey that was done in 2016 or published in 2016. 1500 researchers were surveyed about their metadata. Not really about their metadata, but whether or not they would be able to reproduce, research. And that really comes down to the metadata, the descriptives of the data. So your title, author, the scope and the date that it was created, all of those kinds of things.
Which also makes it important not just for reproducibility, but for finding the data sets online. So of those 1500 researchers that were surveyed, 70% of them said that they had difficulty reproducing other research studies, other researchers. And 50% of them said that they had difficulty with their own studies. And that boiled down to, all of the information that was provided with their data sets.
So you really want to document your data, so that other people and yourself can follow the project details. And so I’ve listed some of the key metadata that can be included. So that could be units of measure abbreviations and codes that you used. Any, any instrument, that you use to, collect that data, any protocols that were associated with that instrument or just in the collection itself.
Any version information of your, the instrument that you’re using and their software packages that you’re using, dates that you collected, your data, those are all important, information to include. Wherever you can try to use a formal discipline specific standard for metadata and when you’re writing up your data management plan, then include, access to the practice for articulating and expressing the information about your data.
If you don’t have discipline specific standards, then I encourage you to use free text readme files. So don’t say them as word documents would be hard to. They could be hard to find, and understand later, but use free text and include a data dictionary and any standard operating procedures along with that. So some sample standardization.
You can use taxonomies and vocabularies from, I’ll just I’m not going to go through the whole list here. You can read that. For me, when I was doing my own research, I, frequently used the gene ontology vocabulary. But there are others out there, and you should use those in your, standardizing your, reporting.
There are also schemas available. And, the lists of metadata schemas can be found on these two, websites, Digital Curation Center and fairsharing.org. And so when you’re thinking about standardizing and schemas then check those two websites out. So going back to Readme files, they are an alternative descriptive tool that, can be helpful in describing the data that you are collecting and storing.
They should be plain text documents. Make sure they’re nonproprietary. As I mentioned, don’t save them as word docs. Save them as plain text. Name the Readme file so that it’s associated with the data set that it is associate that accompanies it. And follow any conventions for expressing the dates the geospatial geological names, taxonomic names follow those conventions as your naming
Your readme file. You can find some examples of readme files, on this, GitHub.com website. And so if you’re not familiar with a readme file, then please do look at that site.
I mentioned that you need to report your roles and responsibilities associated with data management. So, you need to name in your data management plan who will be primarily responsible for implementing the data management plan. If you are applying to a federal agency and multiple institutions are involved, it’s usually the lead PI who will be tasked by the agency to execute the data management plan.
You also need to consider if key personnel are leaving the project. How is the responsibility going to be transferred? You need to report who is going to have access to any sensitive data that you’re going to be collected, and make sure that you’re standard operating procedures are reviewed with the entire team, so everybody knows who’s responsible for what.
Privacy and security is very important with our data. So you need to know and consider the legal and ethical requirements. That might actually preclude sharing any of your data. Look for agency requirements and publication requirements as well. Publisher requirements. How are you going to manage the data to protect privacy? There are several acts associated with data that you need to be aware of.
And I’ve listed some of them, HIPAA FERPA are here. Take a look at this. Threatened species article do not publish in science from 2017 if you’re working with threatened species and archeology also, there is a link there that you can check out. So be aware of acts that will dictate the privacy that you need to maintain for your data.
Are you going to need secure storage then for your data, during the project period? And how are you going to back it up? Who’s going to have access to the to the working data? How are you going to manage the access before and after the grant? So not just during the period of the project, but before and after.
And how are you going to if you’re working with collaborators, how are you going to transfer it? How are you going to share it? All of these things need to be addressed in your data management plan. Know what your departmental, institutional and program policies are on data retention and know that they may influence your plan and you are going to have to note them and note how the policies are going to be followed. If you’re using a particular software package to manage your files and your data, which I’ve listed somewhere, I’ve listed some below.
You need to you need to make note of that in your data management plan. You need to consider and, report how long and why the data is going to be retained and preserved. You need to know any hardware or campus or commercial services that are going to be used for assurance of data preservation.
If there’s costs involved for any of the services then you can include those in your proposal budget, and you need to mention those in your budget justification. And also consider and report how the samples are going to be stored. Are there going to be biological samples or physical samples that are going to be needing freezers and what temperature those freezers are going to be?
Are there’s is there a specialized storage that you need? These are all things that you need to consider and report in your data management plan. WSU has a requirement for use of its regulated data environment or ORDE. I’ve included the link here, and so I’ve just got a quote here. “Going forward, it is required that all WSU researchers and personnel use the RDE in partnership with ITS and your area technology officer, to meet regulatory and security compliance requirements.”
And those requirements come from various acts and regulations put forward by federal and state agencies. So regulated data, if you have it, make sure that you are consulting with your area technology officer, as well as your research administrator. Just to determine if you do have regulated data and you need to use the RDE, if you don’t know who your area technology officer is, you can reach out to ORSO and they can provide that for you.
And, these are just some, considerations that you will need to, inform your ATO and your research administrator. So make sure that they, understand that there’s a need to accommodate regulated data. Give them enough advance notice of your data requirements so that they can put a plan in place, and implement a solution.
You also need to know what information technology and resources are going to be necessary for the storage of this data. And who’s going to need access to it? And then also, how you’re going to transmit the data that you collect to this regulated data environment. So these are all things that you need to talk to your ATO and your research administrator about.
WSU does have some departmental options. So hard drives, network drives, cloud storage, institutionally, we have Outlook, OneDrive. You get one terabyte of, storage for that. We also have high performance computing for, Excuse me, for five year temporary storage. And you can do that through Kamiak. And there is an upcoming workshop on Kamiak.
So I’ve provided the link here. And if you are considering that or want to know more about it, then please attend that workshop. And that is, I will point out that is not through ORAP, that is through Kamiak itself. And then the regulated data environment, which I just got finished talking about.
We also have several guidelines and policies that you need to be aware of. And I’ve provided the links for you here. Our BPPM and our executive policy manual are good sources and also information technology services.
So I mentioned earlier in the presentation that ownership and intellectual property can become issues. And so it’s very important to lay that out at the very beginning. UC Davis had a fight about strawberry plants that had been developed that were used elsewhere, and Alzheimer’s data at UC San Diego also, had a clash over the data.
So it’s very important that you include that ownership and have that talk with your, particularly with your collaborators about who’s going to own the data and the intellectual property that is developed through the research that you are performing. So the Office of Science and Technology in 2013 put out that they would require federal agencies to develop plans for any research data that was generated through their federal funding, be available to the public freely.
And so because of that more and more federal agencies are requiring stronger data management plans. So the DOE, NIH, and NSF, I already talked about, and USDA all require data management plans, and it has stemmed from this February 2013 requirement. So talking about public access and data sharing then, you need to know whether or not your granting agency actually requires data sharing.
There is a website available that you can search if you are not already aware. And I’ve included that link here. So when you are considering public access and sharing your data, if it’s appropriate, then when are you going to share it? Is there going to be copyright protections or commercialization potential that’s going to preclude data sharing
At a specific time? You’re going to need to consider later, time period for sharing that data. Are there going to be conditions for reusing the data? For instance, Are licenses is going to be needed. And how is the data going to be made available for access when you’re finished with your, your project?
So will you be using, discipline specific repository? And there is a website that you can go to, if you’re looking for that. Or are you considering the WSU institutional repositories? So we do have an institution repository that you can put and save your data to. And it is called Research Exchange. It’s managed by the WSU libraries.
It allows public access to data sets, and also provides DOIs for submission. So it doesn’t also have to be data sets that you save to research exchange. Can also be articles aand many, you know, books just about anything. You can create a profile on research exchange that allows you to be searched for people who are searching for you can find it, on research exchange.
So if you’re not aware of that, then I do encourage you to look at that, because it’s a good way to help with the impact of your research as well. It is supported by Libra’s Esploro. So right now and it is Esploro that assists in the sharing and safeguarding of your data.
You can contact libraries for assistance and adding your data sets, but I will make a note that any larger data sets or if you’ve got copyright or privacy concerns on the data, that might not be a good fit for research exchange. But reach out to the libraries because they can help you with that. If you want more information about research exchange, I’ve included some links here.
And they will take you to, through our libraries to information about research exchange and how that can help you store your data and make them publicly available. For the last few slides of this presentation, I want to talk about a few platforms that you can use in developing your data management plan.
And also finding repositories. So DMP tool, is a good one. They actually have some plans that other people have developed and are saved there for you to consider. You can also do your own and leave them there for others to find. This is just a list then of several of the you can see there’s 1520 as of yesterday.
Data management plans for you to look at and consider> Just might, help you in writing your own data management plan. Kind of another tool other than the templates that we have on our own website, these you could find helpful. So check out DMP tool. There’s another platform. It’s called easy DMP.
And you can log in with your ID or log in with Google to access the information that they have. And so you can create your DMP here. I just click the button before I got to this screenshot here, I had click the button to create a DMP and it’s just basically fill in the blank and you can save and return to it at any time.
And then you can also check out the repositories. They’re, discipline specific. There’s a lot. When I was checking out yesterday, a lot of, geosciences and biological sciences, repositories there. But, you know, check them out. GitHub has them also for social and behavioral and economic sciences. Not just biological and computer, also computer and information sciences and engineering.
So if you’re not sure where you want to deposit your data, then check out that list of repositories on easy DMP. And then finally just, list of several more resources that may be helpful to you there. Our internal resources. So, our Office of Research website has some guidelines and policies associated with data management and sharing.
I’ve already mentioned our research development toolbox and provided a link directly to the agency templates there. Our library guide has information about data management, data management plans. And CEREO has data management workshops if you’re interested in that. And then we have some external links as well to unlock at for implementing your own data management practices.
So that is the end of my presentation.
Thanks for joining.