9Development of the ScientificComputing Center at Vanderbilt University

Download (0)

Full text



Development of the Scientific Computing Center at

Vanderbilt University

Lawrence Fu


When Jason Moore1came to Vanderbilt University in 1999 as a professor in the Depart- ment of Molecular Physiology and Biophysics, he knew that he needed a parallel computer (a computer with more than one central processing unit, used for parallel processing) to conduct his research. His research involved the statistical analysis of genetics, specifically the study of gene-gene interactions and the implications for disease risk. The work he wanted to do would require computational power that could be provided only with high-performance computing (HPC).* The first step he took toward this goal was to apply to the Vanderbilt University Medical Center for a Vanderbilt University discovery grant.2This program was a mechanism to stimulate the development of new ideas and allow investigators to develop them for future external federal funding. He received $50,000 to build a parallel computer.

Instead of simply starting work on building a system, he decided to find out if any other researchers at Vanderbilt were working on developing a parallel computer. After talking to other researchers from all over the campus, he discovered that Paul Sheldon,3 a professor in the Department of Physics and Astronomy, had done more work than anyone else in this area. Paul’s area of research was elementary particle physics and the study of the physics of heavy quarks. He had worked on the development of a workstation farm called Vanderbilt University physics analysis cluster (VUPAC).4A workstation farm is a cluster of workstations loosely coupled to provide a very coarse parallel computing environment. Initial support for VUPAC was provided by a National Science Foundation (NSF) academic research infrastructure grant with matching funds from Vanderbilt University. Additional funding by the NSF and the Department of Energy later facilitated upgrades, administration, and maintenance.

Jason and Paul decided to work together and develop a shared resource which they called Vanderbilt multiprocessor integrated research engine (VAMPIRE).5 Paul remembers: “Jason and I quickly realized that we pretty much wanted to do the same things. We had similar goals and similar amounts of money to do it. Basically, it was a meeting of the minds, and we realized [that working together] was the right way to do it. It was an interesting thing to try.” Additional funding for the project was provided by a second Vanderbilt University discovery grant and from the startup funds of


* This type of computing requires scientific workstations, supercomputer systems, high speed net- works, a new generation of large-scale parallel systems, and application and systems software with all components well integrated and linked over a high-speed network.


another physics investigator. This Vanderbilt University discovery grant came approx- imately a year after Jason’s initial grant, but this time it came from the university side rather than the medical center. All together, the group had secured about $150,000 to accomplish the project.

Developing VAMPIRE

Since the group had limited funds, financial costs played a major role in hardware and software decisions. All hardware including the central processing units (CPUs), hard drives, and networking cards were purchased on the Internet for the cheapest prices possible. From the beginning, they knew that they wanted to use Linux for the oper- ating system, but deciding on the specific build and distribution took some time and effort. Information technology services (ITS), the campus agency responsible for over- seeing the information infrastructure of the university as a whole, provided much helpful assistance by supplying personnel support for this decision and other technical details. Another software issue was providing a mechanism to share the resource effec- tively with many users. Two different packages, MAUI6and OpenPBS,7were used. Both of these are freely available HPC cluster resource management and scheduling systems.

Unfortunately, these options did not provide all the functionality that was needed and were not significantly supported by their developers. However, the fact that the software was free outweighed the shortcomings.

While Jason and Paul were the leaders in making these types of decisions, Alan Tackett8 played a critical role in the technical development of VAMPIRE. Alan’s research background is computational physics, and he came to Vanderbilt in 1998 as a postdoctoral research fellow in physics. In 1999, he heard about the VAMPIRE effort getting under way and became involved. With previous parallel computing experience, he ultimately took the lead on technical details. He was instrumental in providing tech- nical expertise and input for key hardware and software decisions. Another contribu- tion Alan made was leading the outreach efforts to attract new investigators. He routinely met with research groups, learned about their work, and explained to them how a parallel computer could aid them in their research.

Finding physical space for the VAMPIRE system was not a difficult task. ITS, besides providing helpful input, volunteered space in one of its raised-floor air-conditioned rooms within the Hill Center, where ITS was located. Jason notes that “ITS was instru- mental throughout the whole process. Having a group on campus that was willing to support us with space and resources was key. ITS was incredibly helpful. If ITS hadn’t been involved, space would have been a bigger issue.”

In the spring of 2000, the group, along with the help of graduate students and post- doctoral research fellows, assembled VAMPIRE. A 2-day pizza party coincided with the activities. It required about 48 hours for the group to assemble by hand the paral- lel computer with fifty-five dual-processor nodes. Since that time, VAMPIRE has been operational 24 hours a day. There have been some hardware failures such as losing a few CPUs, memory sticks, and hard drives, but these types of issues are expected for a system of this size.

In the beginning, only a few other investigators were involved. They made contri- butions to the system in exchange for access to VAMPIRE. One such person was Walter Chazin,9professor of biochemistry and director of the center of structural biology.

Another group that was involved early on was the nuclear physics group. The number of investigators started at five in 2000, grew to ten in 2001, and continued to grow to


sixteen in 2002. The popularity of VAMPIRE grew as others heard about its useful- ness. There was no formal mechanism for attracting other researchers, but word of mouth was particularly effective. Initially, Paul and Jason knew of a few people with whom they wanted to talk, but others simply approached them after hearing about the effort from others. One person who helped publicize VAMPIRE and brought people together was Chip Cox, director of the Vanderbilt Internet 2 project. After the initial setup, additional funding resulted from the participation of new investigators. Two engi- neering professors contributed a large sum of money. One provided $250,000 as part of his startup funds, and another $250,000 came from a U.S. Navy grant. At this time, Ron Schrimpf10joined the effort and would play a large role in the further maturation of VAMPIRE into a larger system. Ron, a professor from the Department of Electri- cal Engineering, contributed a large number of nodes for VAMPIRE through one of his department’s research programs. His research deals with the interface of physics and the semiconductor aspects of electrical engineering. He requires the use of heavy- duty computing for simulations, and his role represents the perspective of a major user of the system.

Growing VAMPIRE into the Scientific Computing Center

The success of VAMPIRE alleviated many initial concerns about its viability. There were questions about whether different research cultures would clash, whether they could all agree on hardware and software decisions, whether it was possible to create a fair sharing mechanism for all users, and whether there would be synergy among the users. VAMPIRE proved that all these concerns could be handled. Jason believes that VAMPIRE was key in making the idea of an even larger computing facility seem fea- sible:“VAMPIRE was critical because it showed that an interdisciplinary team of inves- tigators from across the entire university could come together and work on a project.

It brought the School of Medicine, School of Arts and Sciences, and School of Engineering together on a single project. It got us talking to one another. That in and of itself is a tremendous achievement for the university. . . . VAMPIRE provided a focal point for bringing together investigators. It was a successful pilot project that showed that we could all work together towards a common goal.” Building on the achieve- ments of VAMPIRE, Paul, Ron, and Jason developed the idea for a scientific comput- ing center (SCC). It would not merely be a larger system accommodating more users but would also entail educational outreach efforts to introduce inexperienced users to the world of HPC.

Paul agrees that VAMPIRE was essential to the development of a more compre- hensive computing center for the university: “In our minds, we were going to see how this [VAMPIRE] went. This was a test case to see if we could work together. Always in the back of my mind, I knew that I was going to need significantly more computing.

There was never any question in my mind that I was going to have to find some way to get it. Exactly how much wasn’t clear. Once we got things together and working and moving forward, we realized that we could work together, and it was a great idea. It always seemed to us that we were going to grow. There was talk very early on of a large system. It wasn’t the SCC, but there was talk of a large facility. The SCC and its idea developed and grew over time.”

Another important lesson learned from VAMPIRE was that the education outreach efforts and attracting new users were possible. Paul emphasized this point: “We real-


ized that the whole education outreach efforts and low barriers to participation were possible. For example, Alan interacted with other research groups on campus, talking with them, and working with them. He also taught a class11with Greg Walker12from engineering about methods of parallelizing applications. . . . We realized that there was a lot of interest on campus. There weren’t just going to be a few dedicated computer nerds using it. There were a lot of people on campus who could benefit from this with a little bit of help.” Ultimately, the SCC would not merely cater to a few users but would aim to serve the university community as a whole. In Jason’s words, “We wanted to set up a center that will span the entire university and reaches out to all people doing com- putational work in every department. We eventually hope to get people from music, law, and business using the system.”

Obtaining Funding

Once the concept of the SCC was developed, the next step toward making it a reality was to secure funding. However, initial attempts to find funding were unsuccessful. Two requests were made to the NSF through its major research instrumentation13(MRI) program. This program aims to increase the scientific and engineering equipment for research by supporting large-scale instrumentation investments. Awards typically range between $70,000 and $140,000. Both applications for the SCC asked for $1.5 million but barely missed approval. In addition, Jason in 2001 submitted an application to the high-end instrumentation program14 of the National Institutes of Health (NIH). It received good scores and good reviews, but it did not get approval for the $1.5 million amount that he requested.

Besides external federal funding, internal funding through the university was possi- ble. The university’s Academic Venture Capital Fund15 (AVCF) was established to launch major new transinstitutional initiatives in order to advance Vanderbilt to the front rank of American research universities. The application process required sub- mission to at least one of two strategic academic planning groups (SAPGs), which included one for the medical center and one for the university central. In the event that a proposal involved both the medical center and the university, simultaneous con- sideration would be conducted by both SAPGs, and this was the case with the SCC proposal. If SAPG approval is given, proposals are forwarded to the integrated finan- cial planning (IFP) council for further consideration, and the final step for approval is a recommendation to the university chancellor for funding. One of the central require- ments for a successful proposal was for it to satisfy a set of ten prespecified selection criteria including the following:

1. The proposed effort is in accord with the Vanderbilt University chancellor’s five basic goals for academic excellence and strategic growth:

• We must renew our commitment to the undergraduate experience at Vanderbilt.

• We must reinvent graduate education at Vanderbilt.

• We must reintegrate professional education with the intellectual life of the university.

• We must reexamine and restructure economic models for the university.

• We must renew Vanderbilt’s covenant with the community.

2. The proposed effort will help advance Vanderbilt to the front rank of American universities. To offer only two examples, this could be accomplished by bringing


together existing institutional strengths in a new and distinctive way, or by pro- posing a creative way to strengthen a critical area that limits Vanderbilt’s ability to move forward.

3. The proposed effort enhances the learning environment and opportunities for undergraduate, professional, and graduate students and recognizes the need to recruit and retain an intellectually, racially, and culturally diverse campus community.

4. The proposed effort will require a significant investment in graduate education, and, if successful, will improve the national ranking of one or more graduate programs.

5. The proposed effort involves a broad range of faculty rather than a few individu- als and will foster greater collaboration among the schools.

6. The proposed effort will strengthen disciplinary integrity and expand the interdis- ciplinary range of departments.

7. The faculty leadership is already in place.

8. The proposed investment will strengthen the core disciplines.

9. The proposed effort is bold, requiring significant intellectual and financial invest- ment, with anticipated gains commensurate with the magnitude of the investment.

10. The proposed effort shows clear promise for generating the funding needed to sustain itself after the initial period of AVCF support (of no more than 5 years).

In 2002, the first proposal was submitted to the AVCF but did not receive approval.

It was an administration-driven effort led by the director of ITS at the time. Then in mid-2003, a second proposal spearheaded by Jason, Paul, and Ron was submitted to the AVCF and received approximately $8.2 million in funding for the SCC. One impor- tant distinction to note between the different sources of funding is that federal funding would have provided means solely for building the computer. It would not have covered any other aspects of the SCC. On the other hand, the internal AVCF funding provided capital for data storage, data archiving, data visualization, and personnel related to outreach and support efforts.

Details About the SCC

The approved AVCF proposal explicitly laid out the administrative and organizational structure for the SCC. Jason, Paul, and Ron were the principal investigators and make up the steering committee, while Alan served as project administrator. The steering committee is responsible for all major decisions but will seek input from other com- mittees. There are four other committees:

1. The investigators committee consists of all faculty members who are using or will be using the system. It currently contains approximately fifty investigators from the university. This internal advisory committee is chaired by Walter Chazin, Peter Cummings16 (chemical engineering), Mark Magnuson17 (molecular physiology and biophysics, assistant vice chancellor for research), and Nancy Lorenzi18(biomedical informatics, assistant vice chancellor for health affairs), and it will provide a diverse array of opinions.

2. The external advisory committee consists of three or four individuals from outside Vanderbilt in order to provide an objective perspective.

3. The technical advisory committee, chaired by Jarrod Smith19 from the Depart- ment of Biochemistry, will make hardware and software recommendations.


4. The users committee will communicate the needs of the daily users such as grad- uate students. It is chaired by Greg Walker, a professor from the Department of Mechanical Engineering.

There is an organized reporting structure in place to facilitate communication between committees. Alan, the project administrator, submits quarterly reports to the steering committee. The technical advisory committee provides an evaluation of current oper- ations as well as recommendations for future infrastructure through quarterly reports to the steering committee. The external advisory committee provides biannual reviews to the steering and investigators committee. The steering committee submits annual reports to the investigators committee for approval, and it also provides the annual report to Dennis Hall,20associate provost for research, and Lee Limbird,21associate vice chancellor for research.

Within the proposal, three types of targeted users are enumerated:

1. Experienced investigators who use parallel computing regularly will be able to immediately take advantage of the center.

2. Users who regularly do computing may have never had the resources to do paral- lel computing. These users know about parallel computing but never have had the opportunity to take advantage of it.

3. People who do not know about parallel computing and are not aware that it can help them in their research are still able to use the resources.

In order to aid the second and third types of users, the SCC will employ an education and outreach staff. Informational and tutorial sessions will provide assistance to researchers on how to take advantage of HPC. In Ron’s opinion, “the educational activ- ities in a way are more important [than the computer]. We’re going to have hardware, and we need hardware. If it sits there by itself without anyone helping new people use it, it’s not going to have a big impact on the culture of the campus. What will really be transformative about the center is the other side of it [education outreach], which will help people get involved.” In addition to the outreach staff, there will also be an oper- ations staff, which will maintain the hardware and software resources, and a scientific staff, which will include visiting scholars and center fellows.

By identifying the three types of potential users, the SCC emphasizes catering to the needs of researchers. One of the core philosophies of the center is that it is an inves- tigator-driven resource. Jason believes that this idea is central to the ultimate success of the SCC: “From the start, this has been a grassroots effort. This has been an inves- tigator initiated project. We said we needed this resource, and we’re going to put together the funds to get it started. This was our project. We started it, we organized it, we put it together, and we made it work. Our philosophy is that nobody knows better what we need for our research than us. It’s going to be a center run by the investiga- tors for the investigators.” Paul echoes this sentiment: “I don’t think it makes sense any other way. Investigators are the ones with the stake in it and motivation to make it work. . . . I think the day it stops being that is the day it starts falling apart.”

In order to allow simultaneous use by many people, the SCC follows a relatively straightforward sharing mechanism. Investigators gain access by contributing resources, such as CPUs, to the center. The use of these resources is guaranteed to them whenever they want them. However, people do not use their resources all the time.

Consequently, the pool of excess resources can be split among all other users. So far, this arrangement has worked smoothly. The beauty of this simple agreement can be


summarized in Paul’s words, “You can buy thirty machines and get access to a thousand machines.”

Current State

At this point, the SCC has not yet grown to its full size. It currently contains 400 proces- sors and ranks as number 199 among the top national HPC clusters.22Eventually, the SCC will possess 2,000 processors or 1,000 dual-processor nodes. The first major hard- ware purchases will occur in January or early 2004, and there is a rolling schedule for hardware purchases. Each year, one third of the processors will be added, so the system will not reach full capacity for 3 years. Afterward, the oldest third of the nodes will be replaced each year because the processors typically have a 3-year life cycle before becoming obsolete. While VAMPIRE originally consisted of commodity-priced parts assembled by the group, the SCC will purchase hardware from a third-party vendor who will assemble and test the system.

Besides the processors, supporting infrastructure was another consideration of the proposed budget. The groundwork has been laid for a large tape archive facility with a $75,000 tape library purchase. A disk storage system has been chosen that can suffi- ciently handle the large amounts of data that will be generated. It will be flexible enough to handle growing user needs. Furthermore, the budget allocated funds for spe- cialized visualization hardware that will enable real-time analysis of large, complex data sets with immersive display technologies.

The SCC’s budget calls for an initial large investment in equipment. In subsequent years, the funds will shift a greater percentage to personnel and will reach a steady state of personnel and equipment costs. After 5 years, the center hopes to be able to sustain itself financially because the AVCF provides funding for a maximum of 5 years.

To reach this end, the steering committee plans to hire a financial director in the begin- ning of 2004. The director’s responsibilities will include overseeing the finances as well as driving the outreach efforts. The ideal candidate will have management, financial, and accounting experience. In Jason’s opinion, the financial independence of the SCC will be the biggest challenge to the center. He realizes that this will require much effort but is optimistic: “I think it will work since there are so many people at the university who will use the center. There will be a lot of funding coming into the center, and we should be able to recover most of the costs to keep it going.”

Another major consideration for the future is how to accommodate the needs of so many users. When VAMPIRE was in its beginning stages, involving only a few inves- tigators, Paul recalls that the small group had good communication and a loose orga- nization. They saw eye to eye on most issues. However, as the SCC grows larger, decisions become more complicated: “When everything was small and friendly, it was easy. Now, it has to be big and professional. We have to work for a lot of people. In some cases, we have competing needs and issues. What do we do first [in ramping up the system]? What do we emphasize? Should we spend the personnel and resources this way or another way?” One current example of varying needs of users is the fol- lowing: One individual requires each CPU that he utilizes to have 4 gigabytes of random access memory (RAM) instead of the customary 1 gigabyte. According to Jason, the steering committee, with input from the technical advisory committee, must answer such questions as, “Do we want to have every node have 4 gigabytes? Is it cost- effective to do that? If it’s too expensive, can we have 10 percent of the nodes have 4


gigabytes? How feasible is it to have one part of the system have an increased amount of RAM?”

So far, the SCC has been able to bring together a diverse community. An increased rate of scientific discovery by university researchers should be possible because previ- ously prohibitive computational work is now possible. Other anticipated benefits include enhancing education for students, and the center can serve as a recruitment tool for new faculty. Those who played a major role in its development undoubtedly have learned many lessons along the way. Paul admits that “there a lot of little things that I would have liked to have done differently. I wish I’d understood better that tape systems are such a headache, but it wouldn’t have mattered since there would have been other technical issues. I wish that we had all understood better how best to get this project going. . . . This was the first time I ever took on a project of this magnitude.

You learn things about management along the way . . . how to handle the people involved in the project and the people who will benefit from the project.”

However, future unanticipated obstacles may arise because the SCC is still matur- ing and has yet to achieve its envisioned size. The steering committee is well aware of the fact that what works on a 55-node cluster will not necessarily work on a 1,000-node cluster. Paul notes that “If you have 100 different groups, each may be able to con- tribute in only a special way since NSF [or whichever funding agency] says that they can only spend the money a certain way. We have this infrastructure we have to pay for, and we have to somehow find a way to allocate it back to the users. We’ve been successful so far, and everybody’s been happy.”


1. The initial phases of development with VAMPIRE went relatively smoothly. What factors contributed to this success?

2. What were key factors in making an interdisciplinary project of this size work?

3. Were there any decisions that you would have made differently?

4. Are there any potential issues you believe that the steering committee may not have considered?

5. If another university were planning to set up a similar computing center, what are the most important lessons that they should learn from the Vanderbilt University example?


1. http://phg.mc.vanderbilt.edu/jason.shtml.

2. http://medschool.mc.vanderbilt.edu/oor/pd/index.php?PD=4.

3. http://www.hep.vanderbilt.edu/~sheldon/.

4. http://www.vupac.vanderbilt.edu/.

5. http://vampire.vanderbilt.edu/.

6. http://mauischeduler.sourceforge.net/.

7. http://www.supercluster.org/projects/pbs/.

8. http://vampire.vanderbilt.edu/staff.php#tacketar.

9. http://structbio.vanderbilt.edu/chazin/.

10. http://www.vuse.vanderbilt.edu/~schrimpf/persinfo.html.

11. http://tplab.vuse.vanderbilt.edu/~walkerdg/hpc.html.

12. http://frontweb.vuse.vanderbilt.edu/vuse_web/directory/facultybio.asp?FacultyID=341.


13. http://www.eng.nsf.gov/mri/.

14. http://grants1.nih.gov/grants/guide/rfa-files/RFA-RR-03-009.html.

15. http://medschool.mc.vanderbilt.edu/oor/pd/doc/AVCF_Guidelines_2002_03.doc.

16. http://www.vuse.vanderbilt.edu/~cheinfo/cummings1.htm.

17. http://www.mc.vanderbilt.edu/vumcdept/mpb/magnuson/.

18. http://www.mc.vanderbilt.edu/dbmi/people/faculty/lorenzi_nancy/index.html.

19. http://structbio.vanderbilt.edu/~jsmith/home.html.

20. http://www.physics.vanderbilt.edu/cv/dghall.html.

21. http://medschool.mc.vanderbilt.edu/limbirdlab/.

22. http://www.top500.org/.




Related subjects :