European Commission to coordinate supercomputing developments to bring Europe back in the race - Software is the key

10 Oct 2011 Lyon - IDC just finished a formal study on the parallel software landscape in Europe with recommendations for supporting and advancing parallel software for the European Commission. We discussed the findings with IDC's Steve Conway.

Primeur magazine:So this was the second report after the one from last year that was more on the HPC landscape in general?

Steve Conway:The one from last year was more holistic. It looked at hardware, software, networking, everything around HPC. It looked at science and industry. But one of the recommendations from it was that a lot of money should be put into software development, because especially for the 2020 Horizon perspective it was clear that software advances would be much more important for HPC leadership than hardware advances.

Primeur magazine:Is that because we cannot build fast supercomputers in Europe?

Steve Conway:Because the hardware is already light years ahead of the software at least in peak performance. The software is not able to efficiently exploit even a small fraction of the hardware.

Primeur magazine:I did hear a lot of people say, during the past year that we should have co-design. So design the parallel applications and the software together with the hardware. So how does that fit into this picture?

Steve Conway:That is an interesting question: co-design. Because if you are talking about co-design for some of these big systems you are talking about half a dozen of software codes, in say, quantum chromo dynamics, and related fields. That is essentially what has been happening until now. Those really big systems have been sort of designed to run a handful of specific scientific codes, then co-design makes a lot of sense. But in fact, on many of those systems hundreds of different kinds of applications are being run and the idea of co-designing for optimal performance on hundreds of different of applications in hundreds of different fields that happens in part of the work loads in many of these centres, then co-design is not very practical.

The real question that you raise with the co-design issue is: are these largest supercomputers inevitably doomed to be very narrow purpose machines? If you listen to Thomas Schultness from SCSC - whom I agree with - he says that with each succeeding generation of very high end supercomputers, the breadth of applicability has narrowed. So they are getting narrower and narrower, which makes sense, because the machine, the more highly parallel they get, the more difficult it is to exploit them.

Primeur magazine:But on the other hand, people like Satoshi Matsuoko and Thomas Sterling say, well, they think that they can design machines in co-design that still have some general applicability.

Steve Conway:In essence, and Thomas was at the session where I was, you are really co-designing applications that are in most cases humiliatingly parallel to begin with, so I mean, and that is not to detract from the difficulty of making applications, even very highly parallel applications, like in QCD run and exploit a large fraction of a big machine. That is difficult and praise-worthy to do, but co-designing applications that inherently have a very parallel nature does not necessarily significantly advance the hardware state-of-the-art.

Primeur magazine:But does that also mean we should forget a European supercomputer being amongst the first ten or even among the first hundred supercomputers in the world? Because that is narrow minded supercomputing?

Steve Conway:No, not at all. Because people in all fields, including engineering need access to the most powerful supercomputers that they can have. Even with industrial applications, that typically, do not need a big supercomputer. Even the ISV's in the industrial fields, regularly need access to more computing power than they do have access to. We did that as part of a study maybe three or four years ago, ask industrial users if they could really make use of Petascale machines, and a very high percentage of them said "yes, as long as we can get the applications to scale better, we certainly could, because the problems themselves need that kind of power”, but unfortunately the software is not ready to handle that kind of machine.

A lot of the industrial codes and a lot of the scientific codes were originally written to exploit one processor with access to central memory and vector processors. They have not been deeply rethought, not deeply rewritten since that time.

Primeur magazine:So the idea of having more parallel programming support is to broaden that and to help those kind of applications?

Steve Conway:Not just the scientific applications. But also the important industrial applications, which are really much more stuck in many cases at lower scalability levels than the scientific ones.

Primeur magazine:How will that support look like? Or how do you recommend it should look?

Steve Conway:There should be a model that starts with the notion that software development is typically funded nowadays, not just in Europe, but also elsewhere for two or three years. But it really needs to be funded for at least ten to fifteen years if you are going to create software that is robust enough to last, as it needs to, through multiple generations of hardware that change during that time. It is a lot easier to develop a new generation of hardware than software. You do not want to be designing software that you have to redesign every five years. That is not a good idea.

Primeur magazine:So you could say it should follow the example of LINPACK for instance. The LINPACK Library was also developed over a very long period.

Steve Conway:It was, and LINPACK broke if you remember a few years ago when Jack Dongarra was hired back in by the American government to fix it. LINPACK depends partially on a random number generator that did not scale past a certain point. The machines got bigger. They try to run LINPACK, but it actually broke and that piece had to be fixed.

So we see that models can break all of the time. So the starting point of that is to have a longer funding period for software development. And then there needs to be a kind of development model: the model that we propose is that there be centres of excellence that are covering specific domains or subdomains of science and engineering. Because that is where a lot of the expertise is. So you could have a particular centre, say CERN, for particle physics. CERN would presumably be the designated centre of excellence for the domain of particle physics. What they would then do under this regime is to create a software development plan. The plan would say: here are the five codes that we are going to tackle in the next five or ten years, and here is the schedule for each domain; here is how many people we would have to use and here is going to be the cost of that. Then funding would be given to CERN from the European Union and the Member States on a fifty-fifty basis under the scheme - and significant funding to make that happen.

They would be in charge, they would shepherding the domain, or it might in some other cases be a subdomain, for example in automotive engineering, Stuttgart has expertise in some pieces of that so they might have those pieces to kind of shepherd on behalf of Europe, and some other centre in France might have another piece of it.

So that is the top down approach, but then we are also suggesting what we called "tiger teams", or "SWAT teams" and there will be a couple of hundred of these, two person teams: a domain expert and a parallel programming expert, who go forth in the community, and help scientists develop their software on short term basis and be there for a week to do this, and then if it needs a longer period, they can send it up to their centre of excellence. So there will be this kind of centre of excellence top-down and then bottom-up, with these teams.

And that is not enough: it also has to be bridged to broader dissemination of the software. Currently the pattern that exists in Europe, is that there are all these great codes that are used by only one or a few academic institutions. After all the time that has gone into software development, the software is not broadly disseminated. So something needs to happen there. What we are suggesting is, that there should be a kind of store front site - a very serious storefront site - an exchange, being created, where people can make the software available more broadly for people. This could be (partially) on a paid basis, whatever model they want to use, and where people can also get access to computing resources, to run that or other software on. It could be a place where for example, venture capitalists can go - they do not have a place today - to look for European software firms who are doing something interesting.

It is kind of like your FaceBook friends. Go there and see: this software seems to be really getting popular, we are tracking how many users are downloading, and so forth. They can look at where to make some investments. So there really needs to be a storefront that has this aspect as well.

And for the whole development there needs to be a central coordination - and that is not just us talking, that is the European HPC community telling us. There needs to be a central body within the European Commission, itself, so not only PRACE, that sort of has a coordinating function, and it does not get in the way, of the great momentum that PRACE and other organisations have established, but works closely with them and really drives. PRACE's mission is broadening. It was originally a provisioning mission.

Primeur magazine:Why inside the Commission, and not at a software institute, or whatever?

Steve Conway:Because it has to not only coordinate software, but also hardware, networking, and if you will, the entire ecosystem. None of those things ought to be done separately. You need a coordinator in some central position. In the first study, that was also the strong consensus from the European HPC community, which includes the scientific and engineering community, not just the computational data centres.

Where should you invest, not just to catch up, because Europe is significantly behind the US and behind rising Asian countries. Not just to catch up but to actually seize global leadership in specific areas of HPC. In a prior study it kind of bubbled up. People had a strong consensus about what the areas are: about half a dozen of them. So if you target global leadership in those areas, around the year 2020, you cannot have software going off in this direction, and hardware in the other direction, you really need some type of coordinator.

In the first study, that was also the conclusion. The strong consensus of the HPC community, which includes the scientific and engineering communities and the national funding agencies, not just the computational data centres, was that there needs to be a central European coordinating body, and so people said if it had to be somebody that already exists, then it would be PRACE. But PRACE as it is presently constituted, would not be in a position to do it, because PRACE needs time to evolve from a strong hardware focus to a more holistic focus. Also PRACE was not constituted to drive a leadership strategy for Europe. When I say strategy, the assumption behind this is that the purpose of HPC is not leadership in HPC but leadership in specific areas of Science and Engineering where Europe is already strong today.

Primeur magazine:The Prospect group also had a report. Is that in support of the vision that you presented?

Steve Conway:The reports were generated independently, and the research was done independently. So we really did not work with them, or they with us, but naturally we know the reports, and they have some things in common.

And what we are hearing here in this conference room reinforces what we heard: that software has been the orphan child in all of this.

The hardware road map is pretty well set. People are going to make advancements, but the hardware direction is fundamentally set. By that I mean: distributed memory architectures with x68 processors and accelerators; people can use those in different architectural implementations, but essentially what we have got is a cluster and cluster like architectures, that are built on x68 and GPU or other accelerator based processors.

That is what we have got to work with. That is not ideal perhaps. The advantage of that is price-performance, the disadvantage is it is not a tight fit for HPC as say twenty years ago when vector processors were designed to be a tight fit with HPC. But they were too expensive, so HPC could not really proliferate based on that narrow definition. But what the architecture does today is that it shifts a lot more of the burden upon software, because you essentially are trying to make a washing machine compute. And that is not easy.

The burden has shifted squarely onto software. The model for developing software has really changed. Twenty-five years ago, the vector companies had enough margin. Those companies made enough margin when they sold a machine, there was lots of fat in there so that they could be very highly integrated. They could have their own storage. You know that within Cray Research in its high days about half of its investments was on the software side. They had their own compiler groups.

Not that some of that does not still exist, but essentially the margins on today's architectures are much thinner. Even if they are IBM or HP they cannot afford software developers to port and optimize for their machines, to pay all the scaling and advancement costs as they used to do heavily. The whole business model has changed, which all makes it more and more challenging to move the software forward.

Primeur magazine:Do you think the architecture that you sketched will be here in ten years' time? Because that is what you would need.

Steve Conway:There is not anything on the horizon that would replace the cluster as the dominant species in HPC, when I say cluster, you could say, wait a minute: Cray makes machines, that are not clusters. Yes, they are making clusters. They are architectural clusters, they just turbocharged the interconnect and some other components. But in an architectural sense they are clusters.

Primeur magazine:Thanks a lot for this interview.
Ad Emmen