Logical File Organisation A file is logically organised as follows:

File Handling The logical and physical organisation of files. Serial and sequential file handling methods. Direct and index sequential files. Creating, reading, writing and deleting records from a variety of file structures. Creating code to carry out the above operations. Logical File Organisation A file is logically organised as follows: - A file is made up of records containing fields. - A record is a collection of data that belongs together, consists of a number of fields. o Example: all the data about an individual person. - Field: is a data item of a record and usually contains one piece of data, e.g. a date, first name, age. - Once a program has processed data it can be permanently stored, allowing later retrieval. - This data is stored on magnetic tape (sequential device), magnetic disk (floppy or hard disk) or optical media (CD-ROM/CD-R/ CD-RW /DVD etc.) (Direct access devices). Sequential Device Slow. Inexpensive. Access time is dependent on current position. Random or Direct Access Fast Expensive. Have an almost constant access time. 1. Serial Access - Each record is stored, one after the other, with no regard to any logical order. 1

- It is the simplest form of file organisation. - This type of technique is normally used for storing records for further processing. 001 003 006 004 002 005 Features of Serial Files 1. Easy to implement on magnetic tape. 2. Generally slow access. 3. Usually used for further processing (e.g. Sorting) of records. 4. Are used mainly as temporary files to store transaction data. 5. Is suitable for: a. Batch search servicing (group search). We can group together a number of requests and process them as a group. 6. It is not a suitable: a. On-line access because it is too slow. 2. Sequential Access - Records are stored one after the other but are sorted using a key sequence. - Records are kept in some pre-defined order e.g. Names stored alphabetically, or records stored numerically. - Retrieval is achieved by scanning the entries in the same order e.g. 001, 002, 003, 004, 005 etc, so if we want record number 200 then records 001 to199 have to be scanned first. - The principle advantage is rapid access to sets of records e.g. - If the n th record has just been accessed the (n+1) th record can be accessed very quickly. Features of Sequential Files - Records stored in pre-defined order. - Sequential access to successive records. - Suited to magnetic tape. - To maintain the sequential order updating becomes a more complicated and difficult task. - Records will usually need to be moved by one place in order to add (slot in) a record in the proper sequential order. - Deleting records will usually require that records be shifted back one place to avoid gaps in the sequence. 2

- Very useful for transaction processing where the hit rate is very high e.g. Payroll systems, calculating the student's grades when the whole file is processed as this is quick and efficient. - Access times are still too slow (no better on average than serial) to be useful in on-line applications. 3. Random or Direct Access - Records are accessed directly, allowing records to be read in any order. For example, to read record 005 you just jump directly to it. - You can read or write information anywhere in the file. - This implies that the medium being used allows a jump to any point in the file. In practice, this requires some form of Magnetic disk storage. - Direct addressing is not favoured if the demand is primarily for sequential processing. 4. Indexed Sequential - Similar in principle to the sequential access file. - However, it is also possible to directly access records by using a separate index file. - An indexed file system consists of a pair of files: - One holding the data - One storing an index to that data. - The index file will store the addresses of the records stored on the main file. - There may be more than one index created for a data file. - Example: A library may have its books stored on computer with: - Indices - Author, - Subject - Class mark. - There are two types of indexed files: Fully Indexed Indexed Sequential Fully Indexed Files - The file contains an entry for every single record stored on the main file. - The records will be indexed on some key e.g. Student number. 3

- Very large files will have correspondingly large index files. - The index to a (large) file may be split into different index levels. - When records are added to such a file, the index (or indices) must also be updated to include their relative position and change the relative position of any other records involved. Indexed Sequential Files - A mixture of sequential and indexed file organisation techniques. - Records are held in sequential order and can be accessed randomly through an index. - These files share the merits of both systems enabling sequential or direct access to the data. - The index to these files operates by storing the highest record key in given cylinders and tracks. - Indexed sequential file organisation is very useful where records are often retrieved randomly and are also processed in (sequential) key order. - Banks may use this organisation for their auto-bank machines i.e. customers randomly access their accounts throughout the day and at the end of the day the banks can update the whole file sequentially. Advantages of Indexed Sequential Files 1. Allows records to be accessed directly or sequentially. 2. Direct access ability provides vastly superior (average) access times. Disadvantages of Indexed Sequential Files 1. The fact that several tables must be stored for the index makes for a considerable storage overhead. As the items are stored in a sequential fashion this adds complexity to the addition/deletion of records. Because frequent updating can be very inefficient, especially for large files, batch updates are often performed. 5. Hash Coding - Enable direct retrieval of desired records without the need to search files or indices. - It use a hashing algorithm is first applied to one of the keys of the record (e.g. Driving licence number, student number or National Insurance number), which converts the key to an address, by mathematical or logical calculations. - Direct addressing is used when records have to be searched frequently in an unpredictable fashion. - Example, the sale of spare parts in a garage or the sale of goods in shops where details about individual items have to be made available simultaneously in a random fashion at many points (check out lanes in a supermarket). - One method is to divide the primary key by a prime number and use the remainder as the address. 4

- Example: suppose we have a fairly large list of student's information, the student number is the primary key. - Divide the student number with a prime number, say 97, and use the remainder as the storage location in the file. - A Student No, 1069, and divide it by 97=11 and a remainder 2, which is the location of that student record. - The remainder will be between 0-96. This gives us 97 potential locations for records. - Once records are stored in this fashion, retrieval simply involves supplying a student number, which will be used by the hashing algorithm to locate the desired student record. Advantages of Hash Coding 1. Rapid access to records in a direct fashion. It doesn't make use of large index tables and dictionaries and therefore response times are very fast. Disadvantages of Hash Coding 1. Two keys can sometimes calculate to the same address. (Collision) - Example: A student number 3300, division by 97 will produce a remainder 2. However, we may already have a record (say 1069) in storage location 2. - An extra record will have to be kept in an overflow area. - So if hashing produces more than one location for each record, response time may increase because of the necessity to search the overflow area when the key if the hash address does not match the key we are looking for. 2. Storage space can be wasted if there are not enough records to occupy the reserved spaces. a. Example, if we are using 97 as the prime key there should be close to 97 records to go into these predetermined locations. If we choose to divide by 9713 there should be around 9713 records to optimise the use of storage space. Physical File Organisation - The information is initially mapped onto the physical blocks, and eventually onto the tracks and sectors of a disk. - Example: keep records of students and each student has a unique identification number that is used as a primary key field, e.g. 10052329. - Assume that Hope only has 999 students, for a range of ID s from 001 to 999, hence the following file, will be used to demonstrate the different file organisations: Student_ ID_ Number 001 002 5 Student_Surname George Hugh

003 004 005 006 999 Adams Murray Sinclair Patterson Cookson 1. Serial File Organisation - In order to access data within a serial file a pointer is used. 2. Sequential File Organisation - Our student files can be sorted by ID and stored on magnetic tape. In order to access record 005, Sinclair, - The R/W (read/write) head, which is positioned at the beginning of the file. - Read records 001 through to 004 first. - If we held 999 students records on file, accessing the last record would take a long time. - A preferred method would be to implement sequential file organisation on a disk but this is not possible, so direct access would be the preferred method of file storage and retrieval. 3. Random or Direct Access File Organisation Transferring the above example onto a disk would result in the following: - Added to the disk is an index, which is loaded into RAM and defines the relationship between the primary key and the corresponding disk address: 6

Index Record Disk Address Track Sector 001 00 0 002 00 1 003 00 2 004 00 3 005 00 4 The index tells the disk R/W head where to look for the data (sector and track). The R/W head goes directly to the correct disk track position, waits for the correct sector to rotate under the head and then retrieves the student s record. Criteria for selecting file organisation There are four main criteria to be considered when choosing a file organisation technique: File use ratio (hit rate) File volatility File size User requirements 1. File use ratio (file activity) - File-use ratio (hit rate): If we divide the number of records that are accessed (within a specified process or period) by the total number of records in the file. - High file Ratio: The majority of records are used regularly. 7

- Sequential/serial file organisation may be the appropriate method. - If the ratio is low (say 5% to 10%) : the ability to retrieve a desired record quickly is crucial and therefore - Direct file organisation should be recommended: Examples: a. Payroll production: an example of a high activity file. - Organisations production of payroll and payslips is a regular event, which can be either weekly or monthly. - Such an application requires processing of all or nearly all the employee records. - The file-use ratio will be close to or equal to one (100%). - Thus sequential file organisation is preferred. b. Customer accounts in banks: an example of a medium activity file. - Both random and sequential access are required. - Several customers should be able to withdraw cash simultaneously and randomly, - The bank should be able to update all customer accounts periodically by sequential processing. - Indexed sequential file organisation may therefore be the most suited to this type of application. c. Airline ticket reservations: an example of a low activity file. - In most cases only one record is accessed at a time. - This record is required quickly and therefore direct accessing is most appropriate. Calculating the file use ratio To calculate the file use ratio we need to know the number of records accessed and the number of records in file Examples: File has 8,000 records, 250 of which are accessed and updated per week. File use ratio = 250 / 8000 = 0.03125 per week (very low) 4100 records are accessed per week. File use ratio = 4100 / 8000 = 0.5125 per week (medium) If all but 400 were accessed weekly, i.e. 7600 accessed per week. Then: file use ratio = 7600 / 8000 = 0.95 per week (very high) 2. File volatility - How often files require modification and updating, e.g. Insertions and deletions. - Highly volatile files are not usually Direct Access is used, as this would entail excessive overheads in too frequently updating the index and file. - Low volatile: Indexed sequential organization is used when the data is fairly stable. 8

3. File size When files are large serial/sequential location techniques give longer access times. Thus large files are usually indexed or direct files. 4. User requirements How the user access the records? - Batch access files are use sequential file organisation is likely to be appropriate providing the file activity is reasonable. - Interactive Access use direct access. The type of storage device available. - Magnetic tape will only allow serial/sequential access. - Magnetic disks will use other file organizations. The ease (or complexity) of actually implementing the file organisation technique with the data concerned. Availability/features/cost of software to handle the organisation technique preferred according to other factors. A Note on Physical and Relative Addresses To retrieve records we must obviously know where they are stored. There are 2 ways of indicating the location in which they are stored: 1. Physical Addresses - Tells us the actual physical location of the record on the storage medium e.g. On a magnetic disk we would need to know the cylinder, track and sector which held the record. Physical Address Illustrated below is a disk cylinder. Records are written onto the disk starting with: Track 1 on surface 1, then Track 1 on surface 2, then Track 1 on surface 3. To expand, records could be stored as follows: Cylinder Surface Sector Record 1 1 9 128936A 2 7 1 117237X 3 4 8 456233C 1 4 9 980763A 9

2. Relative Addresses - Modern file organisation techniques usually use relative addressing. - The address is provided according to its position in the file and not its physical location on the storage device. - Thus the 56 th record in a file would have a logical address of 56, quite independent of its physical location. - Relative addresses must be converted to physical ones at some point for the computer to find the record. Relative Address Record 1 68-768 2 68-888 3 97-023 4 98-222 File Content Files will have very different contents according to the work that they are created to assist. The number and type of users may also have an affect. We will briefly discuss 4 general possibilities: 1. Private one user files These are created to be used by one operator or one body. They often hold data for just one job. 2. Private database files - They store data for a group of related users. o Managers in an organisation. - Several programs may well operate on the same database file(s). o A student file may be used to produce student identity cards, update course/exam results and produce mail shots. 3. Public files - These are also called shared files. - They are created in order that users of a common computing service can all access each other s files either in parts or in their entirety, as specified by the producers of the files. 4. Public database files - These are also called databanks and are databases that are open to public enquiry. - They usually concentrate on a particular field such as medicine, law, finance etc. - Often they are not a free service but charge a subscription/registration fee and/or charge for usage. 10

Updating Files The files in an information system are classified by six functions: 1. Master File Contains permanent records that are updated by adding, deleting or editing data. 2. Transaction File Contains records of changes, additions and deletions made to a master file that may be summarised before storage in the master file. A key field is selected for sorting records in a transaction file before updating the master file. 3. Table File (Lookup file) Contains a table of static data e.g. tax rates that is referenced by one of the other types of files. 4. Report File Contains information that has been prepared by the user for display or spooling to a printer e.g. output of the maintenance run of a Pascal program. 5. Control File A small file containing file handling records. 6. History File Backup files from past runs. 7. Batch Processing In batch processing, data is stored during working hours and then copied to a secondary storage medium such as a magnetic tape or server during the evening or whenever the computer is idle. Batch processing usually requires the use of the computer or a peripheral device for an extended period of time. Once the batch job begins, it continues until it is done or until an error occurs. 11