Using CIRC’s High Performance Computing Resources

WSU’s Kamiak High Performance Computing (HPC) cluster is a powerful resource available for all WSU researchers to accelerate their research computing. This presentation highlights key features and recent upgrades to Kamiak and lays out the Center for Institutional Research Computing’s (CIRC) plans for the future. Learn how you can take advantage of Kamiak’s freely available research computing resources and find out how you can purchase expanded compute power for your lab using CIRC’s condominium-style investment model.

Alan Love: Just to say welcome, to the Office of Research Kamiak Information service… Information Session. I’m Alan Love, and I just want to make a quick introduction. Peter Mills is, the deputy director who really manages Kamiak and he’ll be presenting today. I also want to introduce, Rohit Dhariwal, he’s our computational scientist who manages our helpdesk and education programs.

Alan Love: Roy Obenchain and Will Aoki are our system administrators and keep the place running. And then Tim Neumann is our program coordinator who keeps everything coordinated. I just want to mention that we just completed a Kamiak user survey and found a very high level of satisfaction among current investors and users of Kamiak, and I hope that you will find the same.

Alan Love: We have a really outstanding staff and facilities here at, at Kamiak. So, Peter, you can get started. Thank you.

Peter Mills: All right. Thank you for that introduction. So, I’m Peter Mills. I’m the deputy director of the center for Institutional Research Computing, which is CIRC. So basically, what is…I’ll leap right in, what does CIRC provide to the WSU research community? Well, it provides the mission of CIRC is actually provide high performance computing — and the acronym for that you’ll see throughout the slides is HPC — resources and expertise for all of WSU.

Peter Mills: And that involves basically the provision of two capabilities first is the Kamiak HPC cluster, which is the principal facility of of research computing that CIRC provides. And the second is user support, including software installation and training. And this is important HPC is important, particularly for for the research at WSU, in that it plays a key role in advancing computational and data intensive research across many domains, that is beyond the capabilities of most conventional computing facilities, and that includes domains such as or actually specifically techniques such as those found in genomics, computational biology, specifically simulation and modeling, as well as AI training and inference.

Peter Mills: And these techniques, as well as others that are very computationally demanding, are applied across a wide variety of research areas at WSU. It’s very important to note that Kamiak is available for use in research by all faculty, students, and staff at any of WSU’s campuses at no cost. And I’ll describe that, the specific specifics of that in detail in a few moments.

Peter Mills: So a bit of terminology, what is what is the Kamiak HPC cluster specifically, or what is an HPC cluster? Well, I’ll do just one slide of background, the cluster, HPC cluster is a set of computers called nodes — a term from graph theory — connected by a very high speed network. And the idea is that applications run in parallel over many

Peter Mills: Cores and nodes, and so it speeds up solving large problems, makes them tractable in two dimensions. First is the size of the problem can scale up quite significantly. And the other is that it speeds up: the time actually required to solve the problem goes down. And these nodes are much more powerful than your conventional, conventional laptop or desktop.

Peter Mills: Typically the smaller set of nodes we have, or the standard set of nodes that we offer now with 64 cores and 512GB of memory. And these go up to up in the eighties of cores, and two terabytes in memory. So these are very powerful computers. And again, WSU researchers get access to the nodes and storage at no cost. They have free access to idle compute resources for all WSU researchers.

Peter Mills: And they also there are also free storage allocations for all users. That’s 100GB per user and 500GB for each faculty lab. Now we follow in order to build Kamiak, and I’ll describe this a bit more on the next slide, also, Kamiak and CIRC follows a condominium model of investment for faculty and colleges. And the way we build up our compute nodes is that investors may purchase compute nodes on which they get priority access.

Peter Mills: Specifically, it’s non-preemptible jobs running on them. And then the idle resources beyond that that are not in use are basically scavenged for and are available for use by all researchers. In addition, extra storage beyond that provided in the free storage allocations is available for rent from the CIRC service center. So by following the condominium model, we’re able to to fund the compute nodes.

Peter Mills: So basically the compute nodes are funded by faculty colleges. But the overall structure of CIRC and the HPC facility is a joint venture with the colleges and faculty, as well as the Office of Research and ITS, Information Technology Services. Specifically the infrastructure, and by that I mean the racks, admin nodes, storage and network, are funded principally by the Office of Research as well as WSU, through through the state funding.

Peter Mills: Whereas the data center, space, power and maintenance are provided by ITS, and we actually interface quite closely with the data center operations staff in order to install and maintain the nodes. And so that… ITS plays a very important role in providing a lot of support to this venture. Now, in terms of investment, Kamiak was actually, this HPC cluster was established or at least operational in 2016, and the original hardware investment by WSU was about 1 million.

Peter Mills: That includes, significant investments from the colleges. The nodes at that time were about 35. Then, since then, faculty investment as of well, October of 2023 is another 1.6 million. And that refers to the condominium model. So there are about 112 nodes invested for upwards of 3000 cores. In addition to the Office of Research and State funding provided, a significant infrastructure refresh in 2021, on the order of 1.3 million.

Peter Mills: And that includes, some state of the art, enhancements, including virtualized head nodes, high speed InfiniBand network, and then parallel, all flash NVMe parallel storage system.

Peter Mills: In addition, we were awarded in 2023 an NSF MRI grant. That’s a equipment grant for, that focused on AI driven research. That was another half a million dollars. And that provided, and investments that researchers typically find beyond the, the, the amounts that they’re willing to spend on a grant. And that was for three GPU nodes, each with four Nvidia H100 80 gigabyte GPUs.

Peter Mills: So basically those are graphics processing unit accelerators that are play an important role in, enhancing the tractability of AI training and inference, as well as simulations and modeling. I mean, they have a wide, wide range of applications. So each of those nodes was with four GPUs, was about 150,000. So they’re pretty expensive. So that was a good investment.

Peter Mills: It’s open to all the basic researchers and non-preemptible status.

Peter Mills: So that’s one aspect of what CIRC provides, which is that the research infrastructure itself the Kamiak HPC cluster. In terms of support services, in addition to management of the HPC cluster, which is principally the under the auspices of our two HPC system administrators, that means managing the cluster infrastructure, all the admin and logging nodes, network and storage, which is a considerable, task as well as the installation and maintenance of compute nodes.

Peter Mills: And both of those activities, again, are and are joint with the data center operations staff, of ITS. In terms of user support, we provide basically the full gamut that you would expect to to enable effective use of the HPC cluster. And that includes account creation, group authorization. That is, every every, every user is a member of, a faculty lab group, as well as guidance and training in the use of the HPC cluster, the installation and tuning of software applications and problem resolution, and specifically for problem resolution.

Peter Mills: We provide two avenues of support. Both are online. One is Atlassian Servicedesk through which users can submit tickets. And we have a response time, typically in the order of four hours, to handle those. We also have interactive Zoom helpdesk hours that are available by appointment so they can log in, typically through Zoom screen share, like we’re doing now.

Peter Mills: We can interactively help them with their problems or provide additional training. In addition, one of the services that CIRC provides is, investment. We aid investment. We research and recommend node configurations and try to keep them active in state of the art. And we also coordinate the purchase and the installation of nodes purchased by faculty as well as colleges.

Peter Mills: In addition to to those detailed user support services, CIRC also engages in a number of outreach activities. We offer introductory training sessions twice a year, as well as ad hoc, more advanced training sessions on demand. We also, host information sessions such as this one that we’re doing now. We provide, a variety of user communications that includes, monthly events, newsletter.

Peter Mills: We have a mailing list in which we post, upgrades and announcements to the Kamiak HPC facility. We also have WSU and Office of Research news postings, in addition to entries on a number of websites, including our own, which is hpc.wsu.edu, as well as, a link of research.ai and the Office of Research.

Peter Mills: In addition, part of our outreach and part of our strategic strategic plan is to, engage the faculty quite closely. And part of that is we we have surveys to assess user satisfaction, needs, and growth. As Alan mentioned in the beginning, we had a recent faculty survey that indicated training was very useful and rated experience with the CIRC computing resources as very good.

Peter Mills: We also have, I should point out, you know, as part of the effort to maintain stability and sustainability of the, CIRC resources, both our staff as well as, facilities proper. We have surveys to assess, growth and demand in the future that that helps us guide, planning for funding as well as the type of infrastructure we need to provide to meet to meet the computational demands of the research faculty.

Peter Mills: Lastly, we also have a strong liaison with external HPC resources available to researchers. Those include, for example, were a member of the Rocky Mountain Advanced Computing Consortium. We also, have two NSF ACCESS campus champions. I’m one, Rohit is the other. And, part of what we do there, NSF ACCESS is actually a consortium of universities that are funded by NSF with major leadership-class facilities.

Peter Mills: Those are, like exascale computing facilities within that framework, for example, Kamiak is realistically a midrange or mid-size computing facilities with about 4000 cores and about 220,000, GPU pixel cores. So we’re considered mid-range. It’s a good test bed, but a lot of faculty, I think maybe 40% of them avail themselves or have computational applications that are in need.

Peter Mills: The problem size is is beyond what Kamiak can do. So basically use Kamiak as a stepping stone and then get cycle grants in some of these larger facilities, for example, NSF ACCESS. So we we help as much as we can to prepare those, basically those resource grants, out to NSF access. And that’s part of what we do is actually to support research computing by enabling both Kamiak as a testbed as well as liaison with, with exascale, external HPC resources.

Peter Mills: Now, this slide just shows the scope of use across WSU. It is quite, quite widely used. Realistically, we run about, between 60 and 80% cluster utilization. So that’s that’s a considerable a considerable amount of that is actually off the off the shared idle resource backfill queue. Our total users, registered users that are about 1400, total departments, 116 spread across almost all colleges and units.

Peter Mills: Principal among them are CAHNRS, VCEA, CAS, and Vetmed, and some in Vancouver, but wide, basically representatives from all departments, or at least I should say members from all departments pretty much, avail themselves of Kamiak, use resources. And a large number of those are not investors. We only have like, Between 20 and 22 and, actual faculty investors.

Peter Mills: So they provide the principal amount of compute nodes that, that are then used by the wider research community. So it’s very important. This condominium model that Kamiak follows is is a very important strategy and asset for providing, computational facilities to the entire WSU research community.

Peter Mills: Now, in terms of this slide, the only thing I would take away from this slide, well, there are two things I would take away. First off, if you see in the upper left hand corner, that gives you an idea idea of the size of Kamiak. And again it’s a mid-range computing facility currently has 152 compute nodes that’s, split between the colleges and faculty investors, as well as a small set of nodes provided by Office of Research, as well as the MRI grant.

Peter Mills: Total memory, is about 47TB. all the compute nodes storage again inspired parallel file system about 1.2PB on the right hand side. What’s notable there though is not so much distribution among among investor nodes, but the size of this thing down a little. On the right hand side, actually, you see on, Kamiak the the amount of usage of the backfill queue is quite large.

Peter Mills: That’s the dark blue. And among the colleges also, by the way, that’s an entire right hand side that’s has the outer outer maroon border. There’s the colleges as well as the backfill queue. So it’s quite a large usage, among among the entire WSU community.

Peter Mills: Another slide actually shows the growth rate of Kamiak over time. And this doesn’t show the growth of other nodes. You could get a sense of that from the previous slide on investment. This show actually shows the growth in terms of usage and number of CPU hours, logged over time since its inception in 2016. So some pretty significant growth rate.

Peter Mills: And we didn’t post the 2024, data, but basically it’s still marking up. So it’s quite significant growth over time in terms of usage by the research community, computing community.

Peter Mills: A broad variety of major, research programs are supported, supported by Kamiak. And actually, I don’t want to dwell too much on this on this slide. the only thing I would take away from this is that basically the grants that are funding faculty, research, from and that use Kamiak are from basically, extremely large array of, of agencies, that span the research funding, continuum.

Peter Mills: And in particular on the next slide, actually, this is what I prefer to focus on is the, impact, on research funding that Kamiak provides. And this is interesting. So total since its inception in 2016, it’s about $130 million of of grants that, have referenced Kamiak and in their, eREX in their basically in their, funding setup or funding application.

Peter Mills: And you can see this growing over time. increasingly every year actually. So in fiscal year 23 was about 27.9 million to date. In fiscal 24, it’s about 13.7. And that’s that’s significant, for several reasons. Number one, it shows that that Kamiak is used in a broad variety of research and has a real impact, potentially, on getting that funding in here.

Peter Mills: The second is that, the F&A, the facilities and administrative, overhead that’s out of those grants is actually partially attributed, attributable or allocable to, to Kamiak. So we’re to some degree self-sustaining in our funding because we are, basically, as with Office of Research, we are, basically an indirect cost. So that’s pretty that’s pretty significant to have that impact.

Peter Mills: One other aspect of the impact on research is the number of publications. This is even the, we do in our in our end user license agreement, ask users to acknowledge the use of, certain HPC resources in their publications, but the number of trackable, citations, is about 148, in papers, dissertations: 14. That’s pretty significant.

Peter Mills: Presentations eight and, preprints five and spread across a number of different journals. And that’s probably low because I it’s probably the case that not not everyone cites even though we asked them to. But so it’s pretty significant, particularly if you’re looking at NSF grants, you know, they really care about about dissertations and graduation of students. So that’s a significant that’s a significant impact.

Peter Mills: And we’re very pleased to see that.

Peter Mills: Now, in terms or in terms of plans for growth and sustainability.

Peter Mills: As always, we like to track very closely our growth rates and our funding resources to make sure that we can provide, provide this valuable computing resource to the entire community. And so, basically, in terms of our latest tracking, we anticipate about 200,000, per year of faculty investments for each of the next five years. And we accommodate or expect to have a large portion of that focused on AI driven research using GPU accelerators.

Peter Mills: But, I mean, there are of, considering the application areas large amount of simulation, modeling and genomics also, which use more conventional resources. But, GPUs do give a significant pop, even even in most of the applications for, for both, classical as well as as quantum or de novo simulations. So that’s a very important focus area.

Peter Mills: And in concert with that, we track for, we track latest technology advancements very carefully, both in terms of the faculty offerings and nodes, but also in terms of our infrastructure, which is due for a refresh in two years in 2026. I think it’s very important for Kamiak, you know, even though we’re a midsize facility, I like to think that we have one of the best, one of the best, most technologically advanced clusters around.

Peter Mills: For example, the, file system we have, WekaIO, was placed in the IO 500, number one, at least once. That means number one in the world. So it’s very fast, which makes a difference, right? Because, if you look carefully at the research on the time means that we did, a faster file system can have an impact of 3 to 10 times on application speed up side effects, cluster utilization overall, as well as the time-to-fix to perform a single task.

Peter Mills: So it’s very important. So we we look to the future very carefully in terms of our growth rate, in terms of what infrastructure enhancements we have to make to meet meet faculty demand, which is growing. so some of the some of the, anticipated tasks we’re going to be looking at over the next two years are we need to upgrade our Ethernet network to 200 gigabits.

Peter Mills: We already have a very fast InfiniBand network, which is running at 200 gigabit. So we need to complement that. We need to expand our storage. We’re looking very carefully at some archive tiers for that in the three petabyte range. And also in terms of well, more operating system and, in-house, upgrades we need to have we’re transitioning over to a different operating system, a recent one, Rocky, which will have some application enhancements.

Peter Mills: So very much we we’ve focused on incorporating the latest advances in technology. And as mentioned, that includes, upgrade offerings to the Dell next gen next generation servers, which will happen later this summer. As well as, every five years, pretty much, we have to do a refresh of the admin nodes, the virtualized infrastructure as well as the storage servers.

Peter Mills: And for those of you who are new or don’t have an account yet on Kamiak, this slide is pretty important. This is the the last slide. And then we’ll open the floor to questions. And then how do you get started using Kamiak? Well they’re on our, long story short on our website which is hpc.wsu.edu. We do have a requesting access, menu item right there, but, web links, short story to register for an account.

Peter Mills: Kamiak you go to a specific, website, there’s a description of how to invest in compute nodes and storage, as well as how to include Kamiak in that grant. So we have some, some boilerplate template text that you would use actually right there that you would put in any grant application if you’re going to purchase compute nodes or rent storage.

Peter Mills: In terms of getting help, as I mentioned, we have both, Atlassian ticket support and Zoom helpdesk, and those are actually on the menu item under under Support. We also have a lot of user guide and training materials online on our website. Those include training slides as well as videos, a fairly extensive user programing guide, as well as a more succinct Kamiak cheatsheet, which is, helpful.

Peter Mills: It’s helpful, I hope, to a lot of people.

Peter Mills: And that’s pretty much it. I would open the floor to questions from anyone, at this time. Alan, if you’d like to interject some closing comments, you’re free to do so also.

Alan Love: Peter, thank you. I think that gave a great overview of Kamiak. And as I mentioned, the staff and facilities are really outstanding. I hope that, each one of you are able to take advantage of Kamiak in your work and, thank you, Peter Kesecker.