Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com
Specifying a PC for Mascot Introduction... 2 Processor (CPU)... 2 Random Access Memory (RAM)... 3 Hard Disk Storage... 4 Operating System... 4 Web Server Software... 5 Mascot Cluster Mode... 5 Introduction Any recent, high specification PC containing either Intel or AMD processor(s) should make a suitable platform for Mascot. If you are buying a new PC, then a dual processor system, or one which can be upgraded to dual processors, will be a good investment. Systems with more than two processors usually carry a substantial price premium. If you plan to do high throughput work, and need to run Mascot on more than two processors, a cluster of dual processor boxes will usually offer the most cost effective solution. Processor (CPU) Two conditions must be satisfied in order for search speed to be proportional to processor speed. First, the FASTA sequence databases must be memory mapped, as discussed below. Second, the processor cache must be large enough and fast enough to prevent the limited bandwidth between processor and memory becoming a bottleneck. This second factor is critical because, even though the processor may be running at 1 GHz or more, the memory bus will normally be running at 133 MHz or less. The processors mentioned below all have adequate cache provision, particularly the Pentium Xeon variants, which is available with a choice of cache sizes. The variety of processors on the market and the rate at which new models are introduced makes it difficult to give specific recommendations. From Intel, Coppermine Pentium III, Pentium III Xeon, Pentium 4, and Pentium Xeon families are all suitable choices. From AMD, the Athlon is known to perform well. With new or unusual processors, operating system compatibility can be an issue. Hardware compatibility lists for Windows 2000 and Red-Hat Linux can be found here: http://www.microsoft.com/windows2000/! "professional/howtobuy/upgrading/compat/default.asp http://hardware.redhat.com/hcl/genpage2.cgi 2001 Matrix Science Ltd. Page 2
Not all processors are compatible with multiprocessor operation, and the operating system may also impose restrictions. The Intel Pentium III supports dual processor operation, but going beyond this requires a Xeon variant. At time of writing, it appears that Intel s marketing strategy is to restrict the standard Pentium 4 to single processor usage, and promote the Xeon variants, (called simply Pentium Xeon), for multiprocessor applications. You should recognise that very little commercial software actually uses multiple processors, and relatively few 4-way and 8-way systems are in use, so there is a greater chance of encountering hardware and operating system problems with multiprocessor boards than with single processor boards. We have observed excellent scalability with dual Pentium III processors running under Microsoft Windows 2000, NT 4, and Linux. That is, throughput from a dual processor system comes very close to double that obtained from a single processor. However, we cannot predict or guarantee the scalability of Mascot on hardware configurations that have not been specifically tested. Random Access Memory (RAM) RAM requirements are strongly dependent on the selection of databases you plan to search. Mascot Monitor makes a compressed copy of each FASTA database, in which the title lines have been removed and the sequence strings have been packed in a byte efficient manner. The compressed copy of each database is mapped into RAM and, if there is sufficient room, can be locked in place. When a search calls for a database that is not in memory, the search duration is increased by the time taken to read the database from disk. For a long search, such as a no-enzyme specificity search of a large LC-MS/MS dataset, this additional time may be negligible. For a short search, reading from disk may take longer than the search itself. Databases should always be memory mapped, even though a system might not have sufficient physical RAM to hold them all. Memory mapping only consumes virtual address space, and enables the file to be accessed more efficiently. However, it doesn t guarantee that a particular database will be in memory when a search calls for it; some other process may have kicked it out. So, the smaller, frequently searched databases should be locked into memory, guaranteeing that they are always loaded in RAM. RAM requirements can be estimated from the sizes of the FASTA files you intend to lock in memory. For a protein database, the required RAM is roughly 80% of the FASTA file size, while for a nucleic acid database it is roughly 40%. Some examples are given in the following table, but the comprehensive sequence databases increase significantly in size every month. Database FASTA (Mb) RAM (Mb) Compression Swiss-Prot 104 86 1 : 0.82 MSDB 272 220 1 : 0.81 dbest 5429 2115 1 : 0.39 You also need to allow approximately 60 Mb for the operating system (Windows) and some 10 Mb for each executing Mascot search. So, for a single non-redundant protein database, 2001 Matrix Science Ltd. Page 3
512Mb RAM is sufficient. To have Swiss-Prot, MSDB and dbest, plus a few smaller databases locked in memory at the same time requires at least 2.5 Gb. Since many PC motherboards only support a maximum of 1 or 2 Gb RAM, this looks like a problem. But, in practice, it is rarely necessary for a database as large as dbest to be locked in memory. Being composed of short stretches of nucleic acid sequence, it is not suitable for peptide mass fingerprint searches, and tends to be used as a database of last resort for large searches, where the overhead of reading it from disk represents only a small part of the total search time. Hard Disk Storage The Mascot program files require very little disk space in comparison to the sequence databases and the accumulating result files. For the sequence databases, you will need to maintain free disk space of the order of 3 times the largest FASTA file. This is because, during a database update, there may be the current FASTA file and its associated compressed files plus the equivalent for the incoming database. The space needed for result files depends on the overall search profile and on how long results are to remain on-line. Individual result file sizes range from 20 kb for a peptide mass fingerprint search through to several Mb for a large LC-MS/MS dataset. Disk drives are very inexpensive, and most PC s support up to four IDE devices. It is difficult to have too much disk space, especially if you plan to search databases similar in size to dbest. If any databases are not memory mapped, short searches may be disk I/O bound, and a fast disk (e.g. fast wide SCSI) or a disk array (e.g. RAID) can then become an important factor in maximising throughput. Operating System Supported operating systems for Mascot on Intel are: Operating System Max. CPU Microsoft Windows XP Professional 2 Microsoft Windows 2000 Professional 2 Microsoft Windows 2000 Server 4 Microsoft Windows 2000 Advanced Server 8 Microsoft Windows 2000 Data Center 32 Microsoft Windows NT4 Workstation 2 Microsoft Windows NT4 Server 4 Microsoft Windows 2000 Enterprise Server 8 RedHat Linux 7.1, kernel version 2.4.2 or later N/A 2001 Matrix Science Ltd. Page 4
Web Server Software Mascot requires a web server for administration and interactive use. In the case of Windows, Microsoft s Internet Information Server (IIS) is the obvious choice unless you are committed to some other package. IIS is bundled with Windows 2000 and included in Option pack 4 for NT. The Mascot installation program automatically configures IIS versions 4 and later. If you decide to use a web server other than IIS, some manual configuration will be required. The web server provided with NT4 Workstation is called Microsoft Personal Web Server (PWS). This server is very similar to IIS, but differs in a few key features, such as the maximum number of simultaneous connections. Full details can be found at: http://www.microsoft.com/ntworkstation/news/mktbulletins/ntwvnts.asp Apache is a good choice for Linux. It can also be used under Windows, but the current Windows version doesn t support non-parsed headers, which prevents the display of progress reports during a search. http://www.apache.org Running a web browser on the same PC as the web server can take a surprising amount of processor time, so search times may suffer. If the same PC is also used for instrument control and data acquisition, you may need to adjust job priorities using Windows Task Manager to ensure that the instrument gets adequate priority. Mascot Cluster Mode A Mascot licence for 4 or more processors automatically supports operation on a cluster of systems connected by a dedicated 100 Base-T LAN. A cluster offers several advantages over a single, multiprocessor system: Mass market, reliable, low cost PC hardware can be used The cluster can be incrementally expanded as workload increases The RAM required to map sequence databases is distributed across multiple systems, circumventing the limits of a single system. The limited bandwidth of the PC bus is effectively multiplied by the number of systems in the cluster. 2001 Matrix Science Ltd. Page 5