Unity 1.0 Troubleshooting Guide Troubleshooting Utilities MediaNet Analyzer Version 1.0 runs only on the Macintosh and does not ship with Unity 1.0. It is available on the Knowledge Center Website. It has the ability to analyze UnityClientLogs.txt files that are created on the server. It reads the UnityClientLogs.txt file and interprets the log. Depending on what commands you type into the Analyzer, it can give you various graphs and charts, as well as recommendations to what your problems are or what disks should be replaced. Since version 1.0 runs on the Mac, and the UnityClientLogs.txt file is on the NT server, you must bring the file over to the Mac for analysis. In some cases the files are too large for floppy so you must zip them. FSTest FStest is a utility that runs on the Macintosh. It s a performance tool used mostly by Avid Engineering and does not ship with the Unity product. It will tell you if the disks in your Disk Set are performing to expectations. While it does have functions such as sequential write tests; these tests will not blow away media on an existing Disk Set. Because of all of the other utilities and logs available for troubleshooting, Fstest should be used at the request of an engineer. UnityStatistics An application that creates the AvidUnityStatLogs. When you run this application it takes all the I/O and Stats info stored in RAM and writes it to 2 different files that are located in the AvidUnityStatLogs folder. You can view these 2 files in a text-based application to see if you have problems. StorEx StorEx is still a very useful tool for reporting status of drives. StorEx will run on Unity disks and report SCSI errors. Upon initial install of a Unity system it would be a good idea to run a Sequential Read test on all of the drives. This will verify every sector on the disk. Because of the number of disks involved, this may take several hours. If you re running StorEx because you re troubleshooting a suspected disk problem, try running the default Random Read test first for approximately 20-30 minutes. This could save you time in the long run. However, if you still suspect a disk problem and the Random Read
does not show any drive failures, it s possible that the test did not hit the bad sectors of the drive, so a Sequential Read test should follow. Switch Software The Insite manager software has several items worth looking at when troubleshooting cases of I/O problems or uninitialized loops. Obviously you want to make sure the Port Zoning is setup correctly but in the Port Link Statistics you want to look at 2 particular columns. The Invalid Tx Word column indicates data transmission errors. Before looking at this number it s important to note that an inactive but connected loop will cause this number to increment, and every time a system is rebooted it will cause the number here to increment. However, if you have an active loop and the column for this port keeps incrementing this could indicate some sort of problem on that port of the switch. If you want to clear this column so that you can monitor this number starting at 0, you would have to power down the switch. There is no way to clear this number in the current version of the LoopInsite Manager software. The Sync Loss column indicates that you had such a large number of data transmission errors that created a sync loss or reset connection event. Basically the number here indicates a reinitialization of the loop on that port. This is not good. This could indicate and internal problem with the switch, port, bad cable, or FC card; that is if only one port is reporting Sync Loss. However, when there was a problem with a noisy FC card on the Unity server, it caused a lot of Invalid TX Word events which then triggered a Sync Loss on all of the clients. When all clients report sync loss, this usually indicates a more global problem such as a problem with the server or switch. Administration.log Logs This log exists on each client and gives you a history of everything that has happened in the Administration tool since the beginning of the install. Note: The log only reports what administrative functions have been performed from the particular client the log file is on. In other words, if administrative functions have been performed from other clients, those clients will all have different logs. For troubleshooting purposes it s best to perform all administrative functions off of 1 client. If a customer were having a problem doing some administrative function in the Administration tool you could have them send you or fax you a copy of the log to see where the process is failing for them. UnityClientLogs.txt
This log is created automatically on the server and is located in the Unity folder. This log is only created when the server detects that there are I/O errors on the clients, and only logs errors. It continually updates from day to day as errors occur. Every time an error is reported from a disk such as a retry or medium error, it is logged here. The log reports the Disk #, SCSI Error, Date & Time etc It will also list partially protected files. It can be viewed but is best to be interpreted by MediaNet Analyzer, a program that takes all the info in the log, and then consolidates the data in the log into a more readable format. NT Event Viewer Log This log is built into the NT Operating System. It does not contain detailed information about what problems you may be having but is a good place to look if you re having server problems. AvidUnityStat Logs On each client system, there is a Folder titled AvidUnityStatLogs. This folder contains 2 different types of logs. An I/O log and a Stat log. These logs are created only after you have run the UnityStatistics application located in the AvidUnity Utilities folder, also located on the client. The information for these logs is stored in RAM on the Mac until you create the logs. For example, if you were to reboot and then immediately create these logs there would be no information in them. The Stat log gives you information about the Unity software version and tells you if you ve had Disk I/O errors. It will also tell you what the longest amount of time it took for a read, which could indicate that you have a disk, which is starting to fail. If any of the reads or writes took over a 1000ms, it would be very unusual. While there is a host of other information provided in this log, most of it is only of interest to the developers. The I/O log gives you a line of information for every single SCSI I/O that has occurred on the system. It will also tell you what media files were being read. To properly view this particular log you should use Vantage because it has many columns. This may not be all that useful because of all the other tools available to troubleshoot bad disk drives. Debug Logger The UnityDebugLogger runs on the Macintosh. After launching the application, you can type LM and then return to have the system log every single call the Mac OS made to a disk as well as every message that got sent to the server. If you then type D and the return, it will create a log file called thedump.txt. This log is not very useful for CS to troubleshoot problems, but may be useful for developers. Troubleshooting
General troubleshooting guidelines-when troubleshooting Unity you should ask yourself these questions: Is the problem local or global? Does it seem to be hardware or software related? Is it a known issue as reported in Release Notes? Are you following the directions step by step? Are you doing something that is out of the range of what the product was designed to do? Is the system configured properly? 1. What does the Disk Error Analysis Needed (DEAN light) indicator mean in the File Manager Status Tab of the Avid Unity Monitor Tool? What should I do if it comes on? It can mean that you have a real disk problem, such as a medium error, or it could mean that you have had some kinds of switch errors or errors being transmitted from the client, that is affecting the way files are being written to the disks. The Disk Error Analysis Needed message appears on the server after the server detects I/O errors from the clients, then the server adds entries to the UnityClientLogs.txt file on behalf of the client. The server also queries the file system Metadata for partially protected files and lights the DEAN light if it detects them. This light could appear if the client was able to write one copy of the file to the Workspace or Disk Set, but was unable to write the every block of the mirrored copy. In this case, the client will function normally and playback the original file, but the mirrored copy is incomplete and noted by the DEAN light. If this were the case, the client playing back this file would operate normally, because only the mirrored copy is affected, but the DEAN light would come on in the Avid Unity Monitor Tool. If you have any partially protected files or files that could not be mirrored, they will show up in the affected files in the File Status tab. If this is the case, the Reset Event button will NOT be able to clear the DEAN light. It will stay on until you Optimize/reprotect the files, or you perform a Disk Repair, and then an Optimize. Optimizing the workspace: The Optimize feature in the Administration Tool tries to reprotect or remirror any files that for whatever reason, did not get protected. It tries to make a copy of the original unsuccessfully mirrored file and then deletes the original if it is successful. If the affected files column in the server monitor tool shows a value greater than 0, it means you have some partially unprotected files on your system, and you should Optimize your workspaces. Again, the DEAN light will turn on in the File Manager Status tab if this is the case. At this point you should Optimize your workspaces, then go back to the File Manager and hit the Reset Event button to see if the
DEAN light turns off. Last, you should check the Affected Files column again and make sure there are no values greater than 0. If there are still unprotected files, it means the Optimize could not be completed, possibly because of a problem writing to the disk. If this is the case you should do the following: Run the MediaNet Analyzer program (see MediaNet Analyzer User s guide). If it indicates that you have a bad disk, perform a Repair Drive in the Administration Tool. Repair Drive: The Repair Drive function in the Administration Tool takes the bad disk out of the Data Disk set, and replaces it with a Spare disk. But before this actually happens the software attempts to copy all valid data from the bad disk to the spare disk so as to retrieve as much data from it as possible. This takes a while as the system is copying data, block for block to the Spare disk. This is a great feature because if you didn t have all your workspaces mirrored, and the disk that was failing only had a few bad blocks or had another problem that was intermittent, it is possible that you could retrieve most if not all of the data that was on that disk, which would result in minimal data/media loss after the disk repair. Assuming that your problem at this point was a bad disk and you ve already performed the repair, you ll now need to Optimize your workspaces again to make the DEAN light go away. Remember, after performing the Optimize, hit the Reset Event button and check the Affected Files Column to make sure there are no unprotected files. Repair Drive procedure: Unmount all Workspaces from clients Take the drive set offline in the Administration Tool Make a raw drive a spare if you don t have a spare yet Highlight the Bad Drive and the Spare and click the Repair Drive button. The Repair could take 30 minutes or so. When done the Spare drive should be renamed the name of the failed drive, and the failed drive should be placed in the Raw group. Pulling the Bad Drive from the system: Because the File Manager constantly accesses the Admin Disk and saves Metadata, the access lights on the drives in the JBOD will be randomly flashing, making it difficult for you to identify which drive in the JBOD is the bad drive to pull out. To make this process easier, stop the File Manager, highlight the now Raw bad drive, and click the Identify button. The disk should flash for 3 seconds. Once you ve physically located the drive, shutdown turn off the JBOD/MediaArray Enclosure, pull the bad drive, and replace it if you have a replacement for it. Then turn on the JBOD, waiting for all drives to spin up. Restart the File Manager.