------------------------------------------------------------------------ Comp.os.research: Welcome to comp.os.research [l/m 6 Jan 1996] -------------------------------------------------------------------------------- From: bos@serpentine.com (Bryan O'Sullivan) Newsgroups: comp.os.research Subject: Comp.os.research: Welcome to comp.os.research [l/m 6 Jan 1996] Date: 15 Nov 1997 09:00:30 GMT Message-ID: <64jobe$4a1@darkstar.ucsc.edu> Reply-To: os-faq@cse.ucsc.edu Summary: a first introduction to the operating systems research group Archive-name: os-research/welcome Version: $Revision: 1.5 $ Last-Modified: Sat Jan 6 18:16:47 1996 This periodic posting serves as an introduction to newsgroup comp.os.research. A set of companion postings, the Frequently Answered Questions for comp.os.research, summarises some of the past discussions in the group. Comp.os.research is a moderated Usenet newsgroup created to foster the discussion of and exchange of information on topics in the field of operating systems research. The moderator is Darrell Long of the University of California at Santa Cruz. The maintainer of the FAQ is Bryan O'Sullivan . As far as moderation policy is concerned, the following two principles are applied to each article submitted for posting: is it interesting? is it related to research? Job announcements will be posted if they are related to operating systems research; blatantly commercial postings are not acceptable. Calls for papers for most conferences and journals related to OS research will be posted. This is fairly broadly interpreted, since OS research is a large field. Smart alec comments are generally filed unceremoniously in /dev/null. Finally, discussions are cut off when the signal-to-noise ratio falls below a certain (fairly subjective) level. Please consider setting the Followup-To: line whenever discussion drifts away from operating systems research. This posting, like much of Usenet, is maintained on a purely volunteer basis. It is subject to comment and improvement by sending email to . Please note that there is a discaimer and copyright notice at the end of Part 2 of the FAQ. ------------------------------------------------------------------------ Comp.os.research: Frequently answered questions [1/3: l/m 13 Aug 1996] -------------------------------------------------------------------------------- From: bos@serpentine.com (Bryan O'Sullivan) Newsgroups: comp.os.research Subject: Comp.os.research: Frequently answered questions [1/3: l/m 13 Aug 1996] Date: 1 Nov 1997 10:00:12 GMT Message-ID: <63eujc$iq@darkstar.ucsc.edu> Reply-To: os-faq@cse.ucsc.edu Summary: frequent topics of discussion on the operating systems research group Archive-name: os-research/part1 Version: $Revision: 1.19 $ Posting-Frequency: monthly Last-Modified: Tue Aug 13 21:03:39 1996 URL: http://www.serpentine.com/~bos/os-faq/ Answers to frequently asked questions for comp.os.research: part 1 of 3 Copyright (C) 1993--1996 Bryan O'Sullivan TABLE OF CONTENTS 1. Introduction 1.1. How to read this article 1.2. Reader contributions and comments 1.3. Acknowledgments and caveats 2. Recurrent discussions 2.1. Microkernels, macrokernels, and the in-betweenies 2.2. Threads 2.2.1. Distinguishing features 2.2.2. Characterising implementations of multithreading 2.2.3. The history of threads 3. File systems 3.1. Extent-based versus log-structured file systems 4. Mobile and disconnected computing 4.1. Constraints on software 4.2. Communications protocols 4.3. Access to files 4.4. Power management 4.5. Other issues 4.6. An introductory mobile computing bibliography 5. Operating systems teaching 5.1. What good undergraduate-level texts are available? 5.2. Graduate-level texts 5.3. Do any texts cover the implementation of specific operating systems? 5.4. What instructional operating systems can I use? 5.5. Where can I find the canonical list of OS papers for grad courses? -------------------------------------------------------------------------------- Subject: [1] Introduction From: Introduction This posting consists of answers to many of the questions most frequently asked and summaries of the topics most frequently covered on comp.os.research, the Usenet operating systems research discussion group. The purpose of this posting is to circulate existing information, and to avoid rehashing old topics of discussion and questions. Please read all parts of this document before posting to this newsgroup. This newsgroup is moderated; the moderator is Darrell Long . A companion posting to the FAQs, `Welcome to comp.os.research', briefly covers the moderation policy and guidelines for posting to comp.os.research. It can be found in either comp.os.research or news.answers, and is posted regularly. Due to its size, the FAQ is split up into three parts; each is posted once a month. The welcome posting is posted fortnightly. The FAQ is also available in hypertext form on the World-Wide Web, at . You may prefer to browse the FAQ on the Web rather than on Usenet, as it contains many useful hyperlinks. Note: chunks of text of the form [92-02-12-21-20.29] indicate the original posting from which a section of this article was inspired, snarfed, or just plain copied wholesale. The FAQ as available on the Web has hyperlinks to the relevant articles. Other chunks in square brackets are comments and reminders to myself. These latter sections of text will be removed as appropriate material is added, but the attributions will remain. -------------------------------------------------------------------------------- Subject: [1.1] How to read this article From: Introduction This article is posted in digest format; using the `G%' command from within the `nn' newsreader should split it up into separate sub-articles which you can browse through. To skip to a particular question numbered n.m, use `/: \[n\.m\]' from most pagers. From within GNU Emacs, you can use `C-s [n.m]'. This article is treated as an outline when edited by GNU Emacs. -------------------------------------------------------------------------------- Subject: [1.2] Reader contributions and comments From: Introduction Your contributions, comments, and corrections are welcomed; mail sent to will be dealt with as quickly as I can manage. Generally, performing a reply or followup to this article from within your newsreader should do the Right Thing. While I am more than happy to include submissions of material for the FAQ if they seem appropriate, it would make my life a lot easier if such text were proof-read in advance, and kept concise. I don't have as much time as I would like to digest 15K text files and summarise them in three paragraphs for inclusion here. If you are interested in contributing material, please see the to-do list at the end of part 3 of the FAQ. -------------------------------------------------------------------------------- Subject: [1.3] Acknowledgments and caveats From: Introduction Although this FAQ has been the result of a co-operative effort, any blame for inaccuracies and errors lies entirely with my edits. I would like to thank the following people for their part in contributing to this article: Arindam Banerji Surendar Chandra Steve Chapin Crispin Cowan Dan Hildebrand Gordon Irlam Alan Judge Darrell Long Chris Maeda Peter Magnusson Craig Partridge Tom Van Vleck Robert Walsh -------------------------------------------------------------------------------- Subject: [2] Recurrent discussions From: Recurrent discussions A number of topics tend to appear with regularity in comp.os.research. This section attempts to go over some of the most commonly-covered ground. I haven't made the list of topics covered exhaustive by any means. -------------------------------------------------------------------------------- Subject: [2.1] Microkernels, macrokernels, and the in-betweenies From: Recurrent discussions A recurrent topic of discussion in this newsgroup has been the comparison between microkernel (for example Mach and QNX) and `macrokernel' (traditional Unix) operating systems. The basic notion of a microkernel consists of devolving as much functionality as possible into processes rather than the kernel itself; different systems take different approaches to implementing this. For example, some systems (such as Mach) leave device drivers in the kernel, and place higher-level services (such as file systems) outside; others (such as QNX) move device drivers outside of the kernel. However, anecdotal evidence [93-03-03-07-56.52] suggests that the distinction between microkernel and monolithic architectures is becoming more blurred as time goes on, as the two advance. For example, most modern monolithic kernels now implement multiple threads of execution and fine-grained parallelism. Architecturally, this approach begins to appear similar to a microkernel with several kernel-space processes working from shared memory. As an aside, people often complain that the Mach system can't be a `real' microkernel, because it is so large (at least, this is the argument most frequently cited). However, I have been told that automatically-generated code stubs contribute very significantly to the size of the kernel, and that some size reduction would be likely if MIG (the stub generator) produced better code. [Can someone from CMU comment on this?] As mentioned above, the leaving of device drivers in the kernel also contributes to Mach's size. Debating microkernels versus monolithic kernels on the basis of kernel size misses the central, architectural point. In the same way as the point of a RISC processor is not to minimise the instruction count, but rather to make a different tradeoff between what is implemented in the processor instruction set and what is implemented in other ways, the microkernel architectural issue is to determine which services are implemented in the microkernel, and which services are implemented external to that microkernel. By making appropriate choices here, the goal is to enhance various OS attributes in a manner that might not be addressable with a monolithic kernel OS. System attributes such as performance, flexibility, realtime, etc. are all variables which are taken into account. Some history: Ira Goldstein and Paul Dale were the coiners of the term `microkernel' back around 1989. -------------------------------------------------------------------------------- Subject: [2.2] Threads From: Recurrent discussions The exact meaning of the term `thread' is not generally agreed upon. One of the more common usages denotes a `lightweight' process (sequential flow of control) which shares an address space and some other resources with others, and for which context switching time is lower than for `heavyweight' (i.e. kernel-supported) processes. Throughout the following material, when we refer to `processes', this can be taken as meaning heavyweight processes. -------------------------------------------------------------------------------- Subject: [2.2.1] Distinguishing features From: Recurrent discussions Some of the features which distinguish different approaches to threading are listed below: - Number of *concurrent* flows of control: generally, threads may potentially make use of multiple processors in order to allow several to execute concurrently. That is, the model usually takes into consideration the possibility that there may be more than one flow of control active at any time. - Scheduling policy: a thread scheduler may be pre-emptive, in which case a thread is put to sleep either when it waits upon some resource or runs for the full duration of its time quantum, or non-pre-emptive, in which case individual threads continue to run until they relinquish the processor themselves (either through waiting on a resource or calling the analogue of a sleep() function). Systems which are non-pre-emptive and may only ever have a single active flow of control (regardless of the number of processors available) are referred to as coroutine systems. Coroutine programming requires quite a different approach from threads-based programming, as many of the synchronisation and resource-sharing problems which occur in threaded environments need never trouble the coroutines programmer. -------------------------------------------------------------------------------- Subject: [2.2.2] Characterising implementations of multithreading From: Recurrent discussions An important distinction may be made between user-level threads and kernel-supported threads. A user-level thread maintains all its state in user space. A consequence of this is that no kernel resources need to be allocated per thread, and switching between threads can be done without changing address space. A disadvantage is that user level threads cannot execute while the kernel is busy, for instance, with paging or I/O. This would require some knowledge and participation on the part of the kernel. It is possible to combine both methods, as is done in SunOS 5.x (aka Solaris 2.x). Here, one or more light weight processes (LWPs) multitask one or more user-level threads, which in turn are implemented using user-space libraries. Some issues which characterise thread implementations, and which determine the uses to which a threads package may be put, include: - How much by way of kernel resources does a thread require? This will typically limit the number of threads that can be started by a process. - Under what circumstances will the entire process hang? For instance, if some thread gets a page fault, may another thread in that process be dispatched? - Does switching threads require a full system call (as on the SPARC), or may context switches between threads be performed entirely at user level? - How are signals handled? Can signals be masked individually per thread? Is there a `broadcast' signal? - How are stacks handled? Will the stacks shrink/grow dynamically on a per thread basis? Several systems today (QNX and Plan 9, for instance) take the stance that threads `fix the symptom, but not the problem'. Rather than using threads because the OS context switch time is too slow, a better approach, according to the architects of these systems, is to fix the OS. It's ironic, now that even PC-hosted desktop OSes provide MMU-protected multitasking, the fashionable programming model has become multiple threads running in a common address space, making debugging difficult, and also making it more difficult to generate reliable code. With fast context switching, existing OS services like explicitly allocated shared memory between a team of cooperating processes can create a `threaded' environment, without opening the Pandora's box of problems that a fully shared memory space entails. -------------------------------------------------------------------------------- Subject: [2.2.3] The history of threads From: Recurrent discussions [93-04-21-13-32.11] [92-01-27-17-05.54] The notion of a thread, as a sequential flow of control, dates back to 1965, at least, with the Berkeley Timesharing System. Only they weren't called threads at that time, but processes [Dijkstra, 65]. Processes interacted through shared variables, semaphores, and similar means. Max Smith did a prototype threads implementation on Multics around 1970; it used multiple stacks in a single heavyweight process to support background compilations. Perhaps the most important progenitor of threads is the programming language PL/I, from about the 1965 time frame. The language as defined by IBM provided a `CALL XXX (A, B) TASK;' construct, which forked a thread for XXX. It is not clear whether any IBM compiler ever implemented this feature, but it was examined closely while Multics was being designed; it was decided that the TASK call as defined didn't map onto processes, since there was no protection between the threads of control. So Multics took a different direction, and the TASK feature was removed from PL/I by IBM in any case, along with the ABNORMAL attribute and lots of other weird stuff. Then came Unix, in the early 1970s. The Unix notion of a `process' became a sequential thread of control *plus* a virtual address space (incidentally, the Unix notion of a process derived directly from the Multics process design [Saltzer, 66]). So `processes', in the Unix sense, are quite heavyweight machines. Since they cannot share memory (each has its own address space), they interact through pipes, signals, etc). Shared memory (also a rather ponderous mechanism) was added much later. After some time, Unix users started to miss the old processes that could share memory. This led to the `invention' of threads: old-style processes that shared the address space of a single Unix process. They also were called `lightweight', by way of contrast with `heavyweight' Unix processes. This distinction dates back to the very late 70s or early 80s, i.e. to the first `microkernels' (Thoth (precursor of the V-kernel and QNX), Amoeba, Chorus, the RIG-Accent-Mach family, etc). On a side note, threads have been in continuous use in telecommunications applications for quite a long time. See also: [Cheriton, 79] Cheriton, D. R., `Multi-process structuring and the Thoth operating system', Ph.D. Thesis, University of Waterloo, 1979. [Daley & Dennis, 68] Daley, R. C., Dennis, J. B., `Virtual memory, processes, and sharing in Multics', Comm, ACM 11, 306-312, May 1968. [Dennis & van Horn, 66] Dennis, J. B., van Horn, E. C., `Programming semantics for multiprogrammed computations', MAC-TR-21, 1966. [Dijkstra, 65] Dijkstra, E. W., `Cooperating sequential processes', in `Programming Languages', Genuys, F. (ed.), Academic Press, 1965. [Saltzer, 66] Saltzer, J. H., `Traffic control in a multiplexed computer system', MAC-TR-30 (Sc.D. Thesis), July, 1966. -------------------------------------------------------------------------------- Subject: [3] File systems From: File systems This field is discussed both here and in the comp.arch.storage newsgroup. This section needs fleshing out at the moment; it will grow over time (hopefully!). -------------------------------------------------------------------------------- Subject: [3.1] Extent-based versus log-structured file systems From: File systems [92-11-18-10-57.53] [92-11-22-10-16.26] A general definition for a log-structured storage system might be the following: logging is an append-only storage semantics. The unit of logging is the record. Write accesses append records to the end of the log. A log record may become obsolete. Useless records are compacted out of the log when possible. Other write accesses are forbidden. An extent-based file system is another candicate for better filesystem performance. The approach used under QNX, for example, is to have files exist as an array of extents on disk, where each is extent is as many contiguous blocks as could be allocated at that location. By using a bitmap for space allocation, files can also grow `in-place', if adjacent free space on disk exists. This approach allows the unit of disk space allocation to remain 512 bytes, which is also the smallest unit of physical I/O. The potential performance bottleneck of this approach does not happen because the filesystem code passes I/O requests to the disk drivers in units of extents, not 512 byte blocks. The filesystem also heuristically modifies the size of the pre-read requests based on the historical access pattern to the file. This approach provides the performance advantages of larger physical disk block sizes, without the wasted disk space that results from unused portions of large blocks, or the complexity of trying to allocate space from partially unused blocks. -------------------------------------------------------------------------------- Subject: [4] Mobile and disconnected computing From: Mobile and disconnected computing The subject of operating systems for mobile and frequently-disconnected computers has become a recurrent topic in this newsgroup. This section attempts to give an overview of issues in this area. [Text by Arindam Banerji.] -------------------------------------------------------------------------------- Subject: [4.1] Constraints on software From: Mobile and disconnected computing System software for mobile computing is impeded by four distinct constraints: - Compared to stationary computers, mobile computers will always be resource poor [Satyanarayan, 93]. Although currently available PDAs (Personal Digital Assistants) compare favourably with the stand-alone workstations of a few years ago [Marsh, 93], they'll most probably lag behind in compute capabilities, available power, storage availability and communication bandwidth, for some time to come. - Mobility entails computation amid fluctuating resource availability and constraints [Banerji, 93]. Communication bandwidth may be available at discrete intervals, an available resource may suddenly become unreachable or an otherwise in-expensive communication link may be randomly replaced by an expensive alternate in transit. - Security threats to both mobile computational elements as well as the data accessed by them are greatly increased [Satyanarayan, 93]. Not only is it easier to lose, damage or be robbed of a carry-along PDA, but it is often easier to tap into the data transferred (as is well-known to much of the cellular communication industry). Very little work, except for that undertaken by the cellular communication industry, has been done in the area of addressing the specific security needs of mobile computing (as far as I know). - User needs and their application requirements may not be the same as those in stationary systems [Weiser, 91]. As mobile computers become ubiquitous (this phrase coined by Mark Weiser), the number of computer users will most probably increase exponentially. Most or many of these users will be far less computer literate than the average computer user of today. In addition, shopping, information browsing and entertainment may be the typical use of such mobile units, as opposed to traditional scientific computing, database support or word processing. - With the presence of PCMCIA slots in a PDA, it also becomes necessary for an OS to be able to mount and dismount entire OS subsystems on the fly [Hildebrand, 94]. Operating systems need to be able to treat networking, filesystems, and other services as facilities which may be loaded and unloaded on demand. Based upon an amalgam of these criteria, the next few sections discuss some of the main areas of ongoing research in mobile computing. -------------------------------------------------------------------------------- Subject: [4.2] Communications protocols From: Mobile and disconnected computing Mobile-IP [Myles & Perkins, 93] `allows packets between mobile hosts or networks and other hosts (including fixed hosts) to be delivered along close to optimal routes'. Compatibility with existing IP implementations is one of the key problems in Mobile-IP. For example, [Perkins et. al, 93], have suggested a scheme based upon the loose source routing option of IP packets, but most existing IP implementations do not implement this option. Scalability is yet another important issue. The Columbia scheme [Ioannidis et al., 91] uses IP-in-IP encapsulation, thus avoiding problems with non-conforming implementations; but it achieves only sub-optimal routing for mobility across widely distributed locations [Aziz, 94]. Some efficient implementations of IP-in-IP encapsulation capable of supporting near-optimal wide area mobile routing have been suggested [Aziz, 94], but more experimentation is required. For resource-constrained mobile computers, hosting a full IP protocol suite may be an unacceptable resource burden. Being able to gateway with a lightweight protocol to a network node which is hosting a `heavyweight' protocol suite is a valuable capability [Hildebrand, 94]. Lightweight protocols can also make better use of the bandwidth limitations of wireless communications. Apart from this, architectures and implementations that handle the impact of mobility at higher layers have also been proposed -- such as the connection-oriented services discussed by Katz [Keeton et. al., 93], and the mobile socket interface discussed by Casey [Casey, 93]. Current trends would appear to suggest that some form of Mobile-IP will soon become standard, whereas connection maintenance and caching in higher-level protocols still needs to be resolved. -------------------------------------------------------------------------------- Subject: [4.3] Access to files From: Mobile and disconnected computing File access in a mobile computing environment, where the communication link to a file server is not guaranteed, has been a major area of study. Coda [Satyanarayan, 90], a descendant of the Andrew File system (AFS), pioneered support for disconnected operations in file-systems. Coda increases file availability by replicating a single volume at multiple server locations. Disconnected operations occur when the set of accessible servers for a particular volume becomes null. Coda supports disconnected operations by pre-caching the files a user is most likely to need and then allowing all operations on cached copies of these files, while disconnected. Upon reconnection, reintegration occurs through reconciliation of the cached copy with the now-reachable server's copy, through the use of a replay log maintained during the disconnection. Disconnected operations have also been implemented for AFS [Huston, 93]. The highly available peer-to-peer based Ficus [Page, 91] file system achieves similar results, although mobile computing was not one its initial applications. Caching issues are beginning to predominate the open research topics in this area. In between connected and disconnected states, there are many states of expensive, intermittent and unreliable connections. Adapting caching to these varying situations is a necessity. More importantly, as introduced by the Hoarding scheme of Coda, user control over some caching behavior is extremely beneficial, and this need for user input becomes even more important when the server connection is weak. -------------------------------------------------------------------------------- Subject: [4.4] Power management From: Mobile and disconnected computing Current battery technology limits PDA use to only a few hours. The conservation of power through system software is thus becoming a major area of research in mobile computing. Two specific approaches to this problem exist. - Some researchers [Greenawalt, 93] are attempting to analyse the effects of application type, user input and operating system implementations on device power consumption. Based upon simulation data, several power consumption models have been proposed for disks [Greenawalt, 93] [Douglis & Marsh, 93]. Work in characterising and analysing the power consumption problem is still ongoing. - Several industry-led efforts, on the other hand, have focussed on building system support for dynamic power-saving mechanisms. The Advanced Power Management standard presents an interface and structure for manipulating power consumption. The Nomadic System Group at Sun Microsystems has integrated similar support into SVR4 [Bender et. al, 93]; these facilities are also available in QNX. Complete analysis of peripheral device power usage and experimentation with different strategies for implementing power management will certainly be useful. -------------------------------------------------------------------------------- Subject: [4.5] Other issues From: Mobile and disconnected computing Two significant aspects of mobile computing give applications in this environment a very different flavour. - The dynamic nature of the environment forces applications to handle changes in the availability and allocation of software resources. Dynamic changes to environment variables [Schilit, 93], change in the available version of a library [Goldstein, 94] and the ability to lookup and retrieve objects from remote locations [Theimer, 93] are all required to solve this problem. For the very same reasons, user interfaces add on an extra dimension, an issue which very few have addressed so far [Landay & Kaufmann, 93]. All this has caused certain vendors to move towards interpreted environments, based on scripting(??) languages as such as Script-X (Kaleida) and Open Scripting Architecture (Apple). - Money will be a constituent of many of the transactions and applications that mobile computers will typically be used for. Hence, many pieces of system software will be required to handle, understand and optimise the use of money [Kulkarni, 94]. As mentioned by Ed Frank at the ICDCS '93 panel discussion on mobile computing, transaction involving `money and sex' may well become the biggest uses of the mobile computer. Some initial forays into reviewing policies for pricing Internet services [Shenker, 93] may prove to be very useful and so will the experience of current consumer service providers such as CompuServe and America Online. This area will perhaps show the biggest divergence in the years to come, since applications will be far more customer-needs driven than technology-driven, as they have been in the past. Finally, aspects of hardware support are critical to positioning any discussion on mobile computing. The most ambitious system is perhaps the ParcTab system [Schilit, 93] under development at Xerox PARC. The ParcTab is a PDA that communicates via infrared data packets to a network of infrared transceivers. The network, designed for use within a building, designates each room as a communication cell. This infrastructure has the responsibility for providing reliable service as the ParcTab user moves from room to room. More general purpose but less ambitious PDAs are currently available from AT&T (EO), Apple (Newton), IBM (Simon) etc. Almost all recognise some alternate form of input, such as handwriting. The capabilities of these PDAs are sure to increase in the coming years, and hopefully their prices will not follow a similar trend. -------------------------------------------------------------------------------- Subject: [4.6] An introductory mobile computing bibliography From: Mobile and disconnected computing [Marsh, 93] Marsh, B., Douglis, F. & Caceres, R., `System issues in mobile computing', Technical Report, Matsushita Information Technology Laboratory, ITL-TR-50-93 [Satyanarayanan, 93] Satyanarayanan et. al, `Experience with disconnected operation in a mobile computing environment', Proceedings of Mobile and Location-Independent Computing Symposium, August 93, pp. 11-28 [Banerji, 93] Banerji, A., Cohn, D. & Kulkarni, D., `Mobile computing personae', Proceedings of Fourth Workshop on Workstation Operating Systems, October 93, pp. 14-20 [Weiser, 91] Weiser, M., `The computer for the 21st century', Scientific American, September 91, pp. 94-104 [Myles & Perkins, 94] Myles, A. & Perkins, C., Internet Draft, September, 93 [Perkins et. al, 93] Bhagwat, P. & Perkins, C., `A mobile networking system based on IP', Proceedings of Mobile and Location-Independent Computing Symposium, August 93, pp. 69-82 [Ioannidis et. al, 91] `IP based protocols for mobile internetworking', Proceedings of the SIGCOMM-91 conference: Communications, Architectures and Protocols, September 91, pp. 235-245 [Aziz, 94] Aziz, A., `A scalable and efficient intra-domain tunneling mobile-IP scheme', ACM SIGCOMM-Computer Communications Review, Vol 24., No. 1, January 94, pp. 12-20 [Keeton et al., 93] Keeton, K. et al., `Providing connection oriented network services to mobile hosts', Proceedings of Mobile and Location-Independent Computing Symposium, August 93, pp. 83-102 [Casey, 94] Casey, M., `Application and communication support for mobile computing', Dissertation Proposal, University of Notre Dame, January 94 [Satyanarayanan, 90] Satyanarayanan, M., et al., `Coda: A highly available File-system for a distributed workstation environment', IEEE Transactions on Computers 39(4), April 90 [Huston, 93] Huston, L. & Honeyman, P., `Disconnected operation for AFS', Proceedings of Mobile and Location- Independent Computing Symposium, August 93, pp. 1-10 [Page, 91] Page, T. et al., `Architecture of the Ficus replicated file system', Tech Report CSD-910005, University of California, Los Angeles, March 91 [Greenawalt, 93] Greenawalt, P., `Modelling power management for hard disks', Intl. Workshop on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, January 94 [Douglis & Marsh, 93] Douglis, F. & Marsh, B., `Low power disk management for mobile computers', Technical Report, Matsushita Information Technology Laboratory, MITL-TR-53-93 [Bender et. al, 93] Nomadic Systems Group, Sun Microsystems, `UNIX for Nomads: Making UNIX support mobile computing', Proceedings of Mobile and Location-Independent Computing Symposium, August 93, pp. 53-68 [Schilit, 93] Schilit, B., Theimer, M. & Welch, B., `Customizing mobile applications', Proceedings of Mobile and Location-Independent Computing Symposium, August 93, pp. 129-138 [Hildebrand, 94] Hildebrand, D., `QNX: microkernel technology for open systems handheld computing', Proceedings of Pen and Portable Computing Conference Exposition, May 1994. Available via ftp from . [Goldstein, 94] Goldstein, T. & Sloane, A., `The object binary interface -- C++ objects for evolvable shared class libraries', Proceedings of the USENIX C++ Conference, April 94 [Theimer, 93] Theimer, M., Demers, A. & Welch, B., `Operating system issues for PDAs', Proceedings of Fourth Workshop on Workstation Operating Systems, October 93, pp. 14-20 [Landay & Kaufmann, 93] Landay, J. & Kaufmann, T., `User-interface issues in mobile computing', Proceedings of Fourth Workshop on Workstation Operating Systems, October 93, pp. 40-47 [Kulkarni, 94] Kulkarni, D., Banerji, A., Cohn, D., `Operating systems and cost management', Operating Systems Review, January 94. [Shenker, 93] Shenker, S., `Service models and pricing policies for an integrated services Internet', Proceedings on Conference on Public Access to the Internet, JFK School of Government, Harvard University, May 93 [Schlitt, 93] Schlitt et al., `The ParcTab mobile computing system', Proceedings of Fourth Workshop on Workstation Operating Systems, October 93, pp. 34-39 -------------------------------------------------------------------------------- Subject: [5] Operating systems teaching From: Operating systems teaching This section attempts to give some useful pointers to people who teach operating systems, both at undergraduate and postgraduate level. -------------------------------------------------------------------------------- Subject: [5.1] What good undergraduate-level texts are available? From: Operating systems teaching The comments below have been provided by a variety of people, so any `me's or `I's you encounter are not necessarily those of the maintainer! - `Operating Systems Concepts', fourth edition, by Abraham Silberschatz and Peter Galvin is the latest version of this popular text. Addison-Wesley, 1994, ISBN 0-201-50480. This book has been revised to include new and updated information, examples, diagrams, and an expanded bibliography. I think this is the `standard' OS text, although I have a couple of others that I also think are good, and that I draw from when I teach OS. Previous editions of the dinosaur book don't have the greatest organisation, and sometimes wander when describing things. Its strong point lies in the copious examples. Speaking of the third edition (I haven't seen a copy of the fourth edition yet): The first 84 pages cover operating system basics, the next 120 pages cover process management including 30 pages on deadlocks. 130 pages on storage management: memory, virtual memory, secondary storage. 70 pages on file systems and protection. Then 100 pages on distributed systems. The last part of the book has case studies on Unix and Mach: 50 pages on Unix and 30 pages on Mach. The last chapter gives a short 10 page historical perspective. Mail a message with contents `send help' to for further details of the new edition. The book gives a good (but slightly theoretical) overview of operating system concepts. A good complement would be the books covering Minix or BSD, which are more implementation-oriented. - `Operating Systems', Harvey Deitel, Addison-Wesley, 1990, ISBN 0-201-18038-3. Not a bad book; gives the same sort of theoretical treatment of operating systems as the dinosaur book. Includes case studies on Unix, MS DOS, MVS, VM, the Macintosh OS, and OS/2. - `An Operating Systems Vade Mecum', second edition, by Raphael Finkel, 1988, Prentice Hall, ISBN 0-13-637950-8. I really like this book; it is a bit more theoretical than the dinosaur book, but is well-written and clear. I would accompany it with labs based on one of the educational experimental OS's (NachOS, OSP) for hands-on experience. The edition mentioned above is now out of print. However, it may be obtained via anonymous ftp from . Here is the associated chunk of README: This textbook is out of print. It was published by Prentice Hall. The author now owns the copyright. Permission is granted to copy this text for any noncommercial purpose. Feel free to generate copies of the text for your students. You may also photocopy the original book without restriction. Kindly send suggested upgrades to the author: . He is planning a new edition sometime. [It's been a few years since I've looked at this book, so I can't remember what it contains. Can anyone help?] - `The Logical Design of Operating Systems', second edition, Lubomir Bic, Alan Shaw, 1988, Prentice Hall, ISBN 0-13-540139-9. This one isn't as theoretical as Finkel's book, nor is it as long as the dinosaur book. I haven't tried to use it in a course yet, but it looks like a fairly well-rounded text. [Can anyone write a paragraph on the various topics covered ... ?] - `Operating Systems', second edition, William Stallings , Prentice-Hall, 1995, ISBN 0-02-415493-8. I received very positive feedback from students about the first edition of this book; I have not yet seen the second edition. The explanations of topics were easy to understand and complete. An especially nice feature was that at the end of each chapter OS/2, Unix and MVS were used to demonstrate real life implementations of the theory talked about. I found this tying together of theory and practice much nicer than having the practice lumped at the end of the book. - `Modern Operating Systems,' Andrew Tanenbaum , 1992, Prentice Hall, ISBN 0-13-588187-0. This started out as a rewrite of the Minix book, but he pulled the Minix-specific material and added seven chapters on distributed systems. It's a bit heavy for undergrads, depending on how far into the distributed systems you go, but I like Tanenbaum as an author. He'll be bringing out a second edition of the Minix book sometime soon; as he says, one is for `hands-on' (Minix) and one is for `hands-off' (Modern OS). The book is divided into two parts: `traditional' introductory material, taken more or less verbatim from the Minix book, and an introduction to distributed systems. Each parts concludes with a case study and comparison of two well-known systems (Unix and MS-DOS, and Mach and Amoeba). The bibliography at the end is organised well for more advanced coverage of the topics encountered throughout the book. Topics covered in the first part include process concepts, memory management, file system organisation and I/O, and deadlock detection and avoidance. The second part addresses issues such as distributed communication, synchronisation (the section on clock synchronisation is well put together), processes in distributed environments (nothing on process migration), and distributed file systems (using AFS as an example). The second part seems more suitable for advanced undergraduate level or introductory graduate level studies. This book has been translated into German; it is available from Carl Hanser Verlag as `Moderne Betriebssysteme', ISBN 3-446-17472-9. - `Operating System Design: the Xinu Approach', Douglas Comer, Timothy Fossum, 1984, Prentice Hall, ISBNs 0-13-638180-4 (PC edition) and 0-13-638529-X (Macintosh edition). A walk-through of the principles behind, and implementation of, the Xinu operating system, a small instructional OS similar to Unix. While this text is aging somewhat, it presents its material in a clear fashion, and does a good job of covering the "standard" fundamentals of operating systems. - `Operating Systems: Design and Implementation', Andrew S. Tanenbaum, 1986 (?), Prentice Hall, ISBN 0-13-637406-9. This, along with Comer's Xinu books, is the classic text which `teaches by doing', covering the design and implementation of Minix, a microkernel operating system which has a programming and user interface similar to Unix. As with Comer's books, this text is showing its age somewhat (the source is very much out of date with the current Minix distribution), but it still does a good job of presenting the basics of operating system implementation. - `Operating Systems Programming: The SR Programming Language', Stephen J. Hartley , Oxford University Press, 1995, ISBN 0-19-5095790. SR is a language for concurrent programming; this book presents the language, presents some example programs in the context of operating systems or concurrent programming, and provides exercises in the form of Open Student Laboratories. The book is designed to be used in conjunction with one of the standard operating systems texts to provide concurrent programming experience, or can be used alone as an introductory concurrent programming book. I have not seen a copy of it yet, and so cannot comment on its quality. The example programs in the book are intended for running in a Unix environment; they are available via anonymous ftp from , and the SR language itself is available from . -------------------------------------------------------------------------------- Subject: [5.2] Graduate-level texts From: Operating systems teaching This section is still under construction. - `Distributed Systems', second edition, by Sape Mullender, Addison-Wesley, 1994, ISBN 0-201-62427-3. A review is forthcoming. - `Distributed Operating Systems -- the Logical Design', Andrzej Goscinski, Addison-Wesley, 1991, ISBN 0-201-41704-9. A thorough desk reference, but reads a little too much like an encyclopedia for use as a textbook. - `Modern Operating Systems,' Andrew Tanenbaum , 1992, Prentice Hall, ISBN 0-13-588187-0. The section of this book which covers distributed systems is suitable for use at introductory graduate level. See above for further details. - `Concurrent Systems', Jean Bacon, 1992, Addison-Wesley, ISBN 0-201-41677-8. This covers much the same material as `Modern Operating Systems', but goes into rather more detail on databases and languages. The book is divided into four parts, and comes with a separate instructor's manual (ISBN 0-201-62406-0). The first covers basic material, such as OS functions, and system and language support for concurrent processes. Part 2 deals with simple concurrent actions, covering topics such as shared-memory IPC, message passing, persistent data, crashes, and distributed data. The third part of the book covers transactions, concurrency control, and failure recovery. The final section presents a set of case studies, with Unix, Mach and Chorus being covered, along with some of the work done at Cambridge over the past decade. An interesting emphasis is placed on language-level support for concurrency throughout the book, and the focus on database issues is also a good thing. I haven't read the book in as much detail as I would like, but it seems to be well put together. The cramming of so many topics under one cover means that there is probably too much material for a single undergraduate course, and the book perforce does not go into as much detail as I would like on some topics (a problem I also find with Tanenbaum's book). Well worth a look, however. - `Distributed Systems: Concepts and Design', second edition, George Coulouris , Jean Dollimore, and Tim Kindberg, Addison-Wesley 1994, ISBN 0-201-62433-8. This text treats a wide variety of issues at a level suitable for advanced undergraduate and postgraduate teaching. Basic topics covered include IPC, networking and RPC, upon which notions of distributed operation and provision of services are built. Coverage of distributed synchronisation leads on to a treatment of replication, simple transactions and concurrency control. The final chapters include material on distributed transactions, fault tolerance, security, and distributed shared memory. Illustrative examples taken from modern `real world' systems such as Sun RPC, the Andrew File System, and PGP are provided throughout the book, and case studies of the Amoeba, Mach, Chorus, and Clouds systems appear towards the end. Exercises are presented at the end of each chapter. The prose is clear, and the layout pleasant. This is, by a narrow margin, the best distributed systems textbook I have come across. - `Advanced Concepts in Operating Systems -- Distributed, Multiprocessor, and Database Operating Systems', Mukesh Singhal, Niranjan G. Shivaratri, McGraw-Hill, 1994, ISBN 0-07-057572-X. A solid work on advanced operating systems, with some emphasis on theoretical aspects. Well over 2/3 of the book focuses on distributed operating systems. It does a good job of covering all the bases, but at times omits vital information or obfuscates what should be simple issues. -------------------------------------------------------------------------------- Subject: [5.3] Do any texts cover the implementation of specific operating systems? From: Operating systems teaching Some books commonly used in conjunction with the texts listed above are: - `The Design and Implementation of the 4.3BSD Unix Operating System', Samuel Leffler, Kirk McKusick, Michael Karels, John Quarterman, 1989, Addison-Wesley, ISBN 0-201-06196-1. This book describes the kernel of 4.3BSD Unix in some detail, covering process and memory management (the latter being heavily VAX-oriented), file system organisation, device driver internals, terminal handling, IPC, network communications, some details of the Internet protocols, and system operation. I found this book to be well-written and concise. Accompanying the above is the `4.3BSD Answer Book', Samuel Leffler, Kirk McKusick, 1991, Addison-Wesley, ISBN 0-201-54629-9. This short book provides answers to all of the exercises found at the end of each chapter in the daemon book. - `The Design of the Unix Operating System', Maurice Bach, 1986, Prentice Hall, ISBN 0-13-201757-1. This is the authoritative description of the internals of System V Release 2 Unix. Due to copyright restrictions, it contains no actual code, but rather pseudo-code (I didn't find this to be a problem). Topics covered include file system internals, process control and scheduling, memory management, IPC, and device driver architecture. Coverage of mutliprocessor and distributed Unix systems is dated, but this remains a classic operating systems text. - `The Magic Garden Explained: The Internals of Unix System V Release 4', Berny Goodheart, James Cox, 1994, Prentice Hall, ISBN 0-13-098138-9. This books covers the workings of SVR4 in substantial detail; it is more detailed than either of the above texts, and appears to be of very high quality. While the authors recommend the book be read in parallel with study of the original Unix source code, sufficient pseudocode representation of the key algorithms has been included to permit a more or less detailed study without restricted access to the original source code. - `Unix Internals: The New Frontiers', Uresh Vahalia, 1995, Prentice Hall, ISBN 0-13-101908-2. This is quite simply a wonderful book. The broad issues it covers include threads and lightweight processes, and how they interact; signal implementations, process group and session management; process scheduling; IPC; kernel synchronisation and multiprocessor architectures; local and distributed filesystems; kernel memory management; and device driver architectures. Each topic is accompanied by details of its implementation under modern Unix variants such as Solaris 2.x, SVR4.2, and 4.4BSD, and its treatment by the Mach kernel. The writing style is concise and pleasant, and the treatment of each topic is satisfyingly thorough and clear. If you are interested in the implementation of Unix or other operating systems, this book is a "must have". - `Unix Systems for Modern Architectures: Caching and Symmetric Multiprocessing for Kernel Programmers', Curt Schimmel, 1995, Addison-Wesley, ISBN 0-201-63338-8. Covers in extensive detail the operation of caches and symmetric multiprocessors, how they interact, and the issues operating systems must address in order to make effective use of them. Issues addressed include the management of virtually- and physically-tagged caches on uniprocessors, synchronisation and mutual exclusion techniques for multiprocessors, standard multiprocessor kernel architectures, and multiprocessor cache coherency. This book is written in a clear manner, and illustrated effectively. Each chapter ends with lists of exercises and references. My copy contains a number of typographical errors, but I am told that later printings have addressed this issue. I am not aware of any similar book which covers the implementation of a distributed system. -------------------------------------------------------------------------------- Subject: [5.4] What instructional operating systems can I use? From: Operating systems teaching - Minix, from Amsterdam's Vrije Universiteit, was developed by Andy Tanenbaum , and is a mostly-Unix lookalike based on a message-passing microkernel-similar architecture. The system is used in Tanenbaum's `Modern Operating Systems' and its predecessor, `Operating Systems: Design and Implementation'. See the Minix Information Sheet, posted regularly to comp.os.minix, for ordering information; Minix is copyrighted and is not in the public domain, but is available from . For further information, see Andy's Web page at . - NachOS is an instructional OS developed at Berkeley by Tom Anderson , Wayne Christopher, Stephen Procter (and others?). It currently runs on DEC MIPS and Sun SPARC workstations, HP PA-RISC, and 386BSD machines. The NachOS system, along with sample assignments and an overview paper which was presented at Usenix, is available via anonymous ftp from . - OSP (current version is 1.7) is an operating systems simulator, available via ftp from , with username ospftp, and password as in the instructor's guide. Used in `OSP---an Environment for Operating Systems', Michael Kifer, Scott Smolka, Addison-Wesley. - RCOS (Ron Chernich's Operating System) is a simulated operating system that is intended to demonstrate graphically the concepts behind operating systems. Students can investigate and modify the algorithms it uses, and write programs in a Pascal-like language (extended with semaphores and shared memory) which it will execute. RCOS has a windowing interface, and currently runs under MS-DOS; an alpha-quality Unix/X11 port is also available. For further details, check out the Web page at . - SunOS Minix consists of a set of patches for Minix which allows the Minix system to run in a single monolithic Unix process on top of SunOS 4.x on Sun 3 and Sun 4 machines. SunOS Minix is produced by applying a set of patches to Mac Minix 1.5 (both 1.5.10.0 and 1.5.10.1 can be used) or PC Minix 1.5. Also, Atari Minix has been used as the base version by at least one person. The latest version (2.0) includes a preliminary attempt at a port to Solaris 2.x. SunOS Minix is available via anonymous ftp from . - VSTa is not intended as an instructional operating system, but it is certainly small and concise enough to be tractable, and the code is clean and provides modern microkernel features. See part 2 of the FAQ for further details. - Xinu was developed at Purdue by Doug Comer and some others. It was designed to be small and layered, so that the code is succinct and easily understandable. It is intended for education, and is a `conventional' operating system. Xinu runs on the IBM PC, Sun-3, SPARC, LSI, MIPS, Macintosh, and VAX architectures. The system is used in Comer's `Operating System Design: the Xinu Approach'. See for licensing information. -------------------------------------------------------------------------------- Subject: [5.5] Where can I find the canonical list of OS papers for grad courses? From: Operating systems teaching [93-03-14-17-09.47] Darrell Long maintains a bibliography which provides a good starting point for graduate OS course reading lists. This may be imported using refdbms as ucsc.grad.os, from refdbms.cse.ucsc.edu 4117 or refdbms.cs.vu.nl 4117. ------------------------------------------------------------------------ Comp.os.research: Frequently answered questions [2/3: l/m 13 Aug 1996] -------------------------------------------------------------------------------- From: bos@serpentine.com (Bryan O'Sullivan) Newsgroups: comp.os.research Subject: Comp.os.research: Frequently answered questions [2/3: l/m 13 Aug 1996] Date: 1 Nov 1997 10:00:25 GMT Message-ID: <63eujp$iu@darkstar.ucsc.edu> Reply-To: os-faq@cse.ucsc.edu Summary: frequent topics of discussion on the operating systems research group Archive-name: os-research/part2 Version: $Revision: 1.22 $ Posting-Frequency: monthly Last-Modified: Tue Aug 13 21:03:28 1996 URL: http://www.serpentine.com/~bos/os-faq/ Answers to frequently asked questions for comp.os.research: part 2 of 3 Copyright (C) 1993--1996 Bryan O'Sullivan TABLE OF CONTENTS 1. Available software 1.1. Where can I find Unix process checkpointing and restoration packages? 1.2. What threads packages are available for me to use? 1.3. Can I use distributed shared memory on my Unix system? 1.4. Where can I find operating systems distributions? 1.4.1. Distributed systems and microkernels 1.4.2. Unix lookalikes 1.4.3. Others 2. Performance and workload studies 2.1. TCP internetwork traffic characteristics 2.2. File system traces 2.3. Modern Unix file and block sizes 2.3.1. File sizes 2.3.2. Block sizes 2.3.3. Inode ratios 3. Papers, reports, and bibliographies 3.1. From where are papers for distributed systems available? 3.2. Where can I find other papers? 3.3. Where can I find bibliographies? 4. General Internet-accessible resources 4.1. Wide Area Information Service (WAIS) and World-Wide Web (WWW) servers 4.2. Refdbms---a distributed bibliographic database system 4.3. Willow -- the information looker-upper 4.4. Computer science bibliographies and technical reports 4.5. The comp.os.research archive 4.6. Miscellaneous resources 5. Disclaimer and copyright -------------------------------------------------------------------------------- Subject: [1] Available software From: Available software This section covers various software packages, operating systems distributions, and miscellaneous other such items which may be of interest to the operating systems research community. If you have written, or know of, some software which you believe would be of fairly wide interest, please get in touch with the FAQ maintainer with a view to having a short spiel and availability information included here. -------------------------------------------------------------------------------- Subject: [1.1] Where can I find Unix process checkpointing and restoration packages? From: Available software - [93-01-21-10-18.30] The Condor system is available via anonymous ftp from . Condor works entirely at user level [no kernel modifications required] but doesn't currently support interprocess communication, signals, or fork(). Definitely worth a look. - Bennet S Yee implemented a `mostly portable' checkpoint and restore package back around 1987. When the programmer invokes the checkpoint procedure, it saves the state to a file; when a second process with the same program (but with different arguments) is started which calls the restore procedure, it reads the old state from the file. Available via anonymous ftp from . This package is known to work for Pmaxen, Sun4's, Sun3's, IBM RTs, and VAXen. Porting it to a new architecture should be relatively simple -- look at the README file. -------------------------------------------------------------------------------- Subject: [1.2] What threads packages are available for me to use? From: Available software Now that POSIX has arrived at a standard threads interface, it is expected that all major Unix vendors will soon release conformant threads packages. Currently, vendor-supplied threads packages vary widely in the interfaces they provide. Some vendors' packages conform to various drafts of the POSIX standard, while others provide their own interfaces. OS/2, Windows NT and Windows 95 all provide threads interfaces. None conforms to the POSIX standard, and neither IBM nor Microsoft has signalled any intention to provide conformant threads interfaces. - Michael T. Peterson has written a POSIX and DCE threads package, called PCthreads, for Intel-based Linux systems. See for more information. - Christopher Provenzano has written a portable implementation of draft 8 of the IEEE Pthreads standard. See for further details, or fetch the software itself from . Currently supported are i386/i486/Pentium processors running NetBSD 1.0, FreeBSD 1.1, Linux 1.0, and BSDi 1.1; DECstations running Ultrix-4.2; SPARCstations running SunOS 4.1.3; and HP/PA machines running HP/UX-9.03. As far as I can see, development of this library has halted (at least temporarily), and it still contains many serious bugs. - Georgia Tech's OS group has a fairly portable user-level threads implementation of the Mach Cthreads package. It is called Cthreads, and can be found at . It also contains the Falcon integrated monitoring system. It currently runs under SunOS 4.1.X, Irix 4.0.5, Irix 5.3, AIX 3.2.5, Linux 1.0 and higher, and KSR1 and KSR2. It is a fairly easy to port to other architectures. Current ports in progress are Solaris 2.4 and AIX 4.X. - The POSIX / Ada-Runtime Project (PART) has made available an implementation of draft 6 of the POSIX 1003.4a Pthreads specification, which runs under SunOS 4.x; the current release is version 1.20. Available using anonymous ftp from . - Elan Feingold has written a threads package called ethreads; I don't know anything about it, other than that it is available from . - Stephen Crane has written a `fairly portable' threads package, which runs under Sun 3, Sun 4, MIPS/RISCos, Linux, and 386BSD. It is available via anonymous ftp from , with documentation in the same directory named lwp.ps.gz. - QuickThreads is a toolkit for building threads packages, written by David Keppel. It is available via anonymous ftp from , with an accompanying tech report at . The code as distributed includes ports for the Alpha, x86, 88000, MIPS, SPARC, VAX, and KSR1. - On CONVEX SPP Exemplar machines there is a Compiler Parallel Support Library (CPSlib), a library of thread management and synchronisation routines. CPSlib is not compatible with anything else, but the interface is sufficiently similar to the Solaris threads or pthreads interface to allow straight porting. One special feature of CPSlib is the (possible) distiction between "symmetric" and "asymmetric" parallelism. A small number of vendors provide DCE threads packages for various Unix systems. -------------------------------------------------------------------------------- Subject: [1.3] Can I use distributed shared memory on my Unix system? From: Available software - CRL is a simple all-software distributed shared memory system intended for use on message-passing multicomputers and distributed systems. CRL 1.0 can be compiled for use on the MIT Alewife Machine, Thinking Machine's CM-5, and networks of Sun workstations running SunOS 4.1.3 communicating with one another using TCP and PVM. Because CRL requires no functionality from the underlying hardware, compiler, or operating system beyond that necessary to send and receive messages, porting CRL to other platforms should prove to be straightforward. General information about CRL can be found at . The CRL 1.0 source distribution (sources for CRL 1.0 and several applications, user documentation, and a postscript version of a paper about CRL to appear in this SOSP later this year) is available at . - Ron Minnich has implemented a distributed shared memory system called MNFS, which is a modified version of NFS and runs alongside NFS in the kernel. Performance is good; page faults under FreeBSD 2.0R run at about the same speed as NFS (~5.9 milliseconds per page). If you need to update a page from one host to many clients, it can be done at a cost of 1.2 milliseconds or so per client. This scales: networks of 128 nodes running MNFS have been set up, and times should improve over faster LANs than Ethernet. The MNFS programming model uses mmap'ed files. Programs map files in and then use them as ordinary memory. Cache consistency of a page is maintained by the MNFS servers, ensuring that there is only one writeable copy in the network at a time. The model is not strongly coherent; read-only copies of a page are only refreshed by an explicit action on the part of the holder of a writeable page (using msync). For those who don't like this style of programming, a parallel C compiler has been retargeted to use MNFS on clusters and networks of computers running Condor. Both performance and scalability matched explicitly mmap-coded systems. The system has been implemented on Sunos 4.1.x, Solaris 2.2 and 2.3, IRIX 5.2 and 5.3, and AIX 3.2. All of these were legally encumbered, so the FreeBSD version is currently the only freely-available implementation. MNFS is available from , and may be installed either as a set of diffs to the FreeBSD 2.0.5R kernel, or installed in-place. Also included in this directory is a slightly out-of-date paper on MNFS, and a more current manual. A Linux port of MNFS is in the works. -------------------------------------------------------------------------------- Subject: [1.4] Where can I find operating systems distributions? From: Available software This section covers the availability of several well-known systems; the only criterion for inclusion of a system here is that it be of interest to some segment of the OS research community (commercial systems will be accepted for inclusion, so long as they are pertinent to research). -------------------------------------------------------------------------------- Subject: [1.4.1] Distributed systems and microkernels From: Available software See part one of the FAQ for further information on some of the systems listed below. - [93-03-31-22-49.53] ACE is the distribution, support and sales channel for Amoeba. `Due to overwhelming response from non-profit organisations wishing to obtain Amoeba for their research activities', VU is offering Amoeba 5.2 to research institutions for more or less free (via ftp at no charge, or on tape for $500 on Exabyte or $800 on QIC-24). Amoeba currently supports 68020 and 68030-based VME board machines, as well at i386- and i486-based AT PCs and Sun 3 and 4 machines. For further information on `commercial' Amoeba, you can contact ACE by email at , by phone at +31 20 664 6416, or by fax at +31 20 675 0389. Universities interested in obtaining a license should send mail to , or fax to +31 20 642 7705. - Chorus Systemes has special programmes for universities interested in using Chorus. For more information on the offerings available, conditions, and other details, get the following files: - - - - The Cronus object-oriented distributed system may be obtained via ftp from ; email for details of the account name and password. Before attempting to get the Cronus distribution, you must obtain, via anonymous ftp, . Maintenance, hotline support, and training for Cronus are available from BBN. Send email to the above address for information on these, or on obtaining a commercial license. - Flux is a Mach-based toolkit for developing operating systems; you can find more information about it on the Web at . - Horus is available for research use; contact Ken Birman or Robbert van Renesse for details. - Isis has not been publicly available since 1989, but may (I'm not sure) still be obtained using anonymous ftp from or . After 1989, the code was picked up by Isis Distributed Systems, which has subsequently developed and supported it. The commercial version of Isis (available `at very low cost' to academic institutions) is available from the company. Email for information, or call +1-212-979-7729 or +1-607-272-6327. - Information on obtaining the latest Mach 4 distribution is available from the University of Utah's Mach 4 pages, at . - The Plan 9 distribution is now commercially available for $350; it consists of a two-volume manual, a CD-ROM with all the sources, and four PC diskettes comprising a binary-only installation of a fairly complete version of the system that runs on a PC. For more information, ; this site houses ordering information, a browsable copy of all the documentation, and the PC binary distribution. Kernels exist for the Sun SLC, Sun4Cs of various types, NeXTstations, MIPS Magnum 3000, SGI 4D series, AT&T Safari, `a whole bunch of' PCs, and the Gnot. Sydney University Basser Department of Computer Science has a port of Plan 9 underway to the DEC Alpha at the moment. A port to the Sun 3 has been completed. Contact for details. The Plan 9 user mailing list may be subscribed to by sending mail to <9fans-request@cse.psu.edu>. - QNX is available for academic applications through an education support programme run by QNX Software Systems, whereby QNX systems can be obtained for educational purposes at very low cost. For commercial and education availability and pricing, contact: QNX Software Systems QNX Software Systems 175 Terrence Matthews Cr. Westendstr. 19 Kanata, Ontario K2M 1W8 6000 Frankfurt am Main 1 Canada Germany 1 800 363 9001 +49 69 9754 6156 x299 +1 (613) 591 0931 +1 (613) 591 3579 (fax) +49 69 9754 6110 (fax) Versions after 4.2 of QNX run on the i386 and later processors, with a 16-bit kernel included for i286 machines. Native optimisations and a compiler for the Pentium are also included. Further marketing information can be obtained on the World Wide Web from . - The 1.1 Research Distribution of the Spring distributed object oriented operating system is available. Spring is a highly modular, object-oriented operating system, which is focused around a uniform interface definition language (IDL). The system is intrinsically distributed, with all system interfaces being accessible both locally and remotely. The 1.1 Research Distribution adds a number of fixes and improvements, including a Spring-Java IDL system that facilitates writing Java applets that can talk across Spring IDL interfaces. The Spring SRD 1.1 Binary CDROM is $75 to Universities and $750 to commercial research institutions. This includes all of the software and documentation necessary for installing, running, and developing new system modules and applications in Spring. All binaries, IDL files, development tools, key exemplary sources, and course teaching materials are included. A standard full source license and source CDROM is also available for $100 to Universities and $1000 to commercial research institutions. For more details and ordering information, see . - [93-02-07-16-03.48] The Sprite Network Operating System is available on CD-ROM. The disc contains the source code and documentation for Sprite, a research operating system developed at the University of California, Berkeley. All the research papers from the Sprite project are also included on the disc. This software on this disc is primarily intended for research purposes, and is not really intended to be used as a production system. Boot images are provided for Sun SPARCstations and DECstations. The CD-ROM is in ISO-9660 format with Rock Ridge extensions. The disc contains about 550 megabytes of software. You can get an overview of the Sprite Project, and a complete list of what is on this disc, by anonymous ftp from . If you would like a CD-ROM please send $25. Add $4.95 if you would like a caddy too. S&H is $5 (per order, not per disc) for US/Can/Mex, and $10 for overseas. If you live in California, please add sales tax. You can send a check or money order, or you can order with Mastercard/Visa/AmEx. Bob Bruce Walnut Creek CDROM 1547 Palos Verdes Mall, Suite 260 Walnut Creek, CA 94596 United States 1 800 786-9907 (USA only) +1 510 947-5996 +1 510 947-1644 (fax) - VSTa is a copylefted system written by Andrew Valencia which uses ideas from several research operating systems in its implementation. It is currently in an `experimental but usable' state, and supports `lots of' POSIX, and runs on a number of different PC configurations. For further information, send mail to , or ftp to . [Chorus, Clouds?, Choices?] -------------------------------------------------------------------------------- Subject: [1.4.2] Unix lookalikes From: Available software - FreeBSD is available via ftp from , , and . The latest version is derived from 4.4BSD Lite, and contains many extensions. See for further information. - NetBSD is available via ftp from , and is also derived from 4.4BSD Lite. See for more information. - Linux is available via anonymous ftp from , , and . It is a freely-distributable System V compatible Unix, and is covered by the GNU General Public License. Linux runs almost all PCs with i386 or better CPUs and at least 4 megabytes of memory. See for further details. - 386BSD is available via ftp from or . It lies mid-way between 4.3BSD Reno and 4.4BSD internally, and contains no AT&T-copyrighted code. 386BSD runs on ISA bus PCs with i386 or better CPUs. Use of 386BSD is not recommended, since it is unstable and has long since been superseded by FreeBSD and NetBSD. - The Hurd is the GNU operating system, being written by Michael Bushnell. It is based on Mach 3.0, and should be available on most systems to which Mach has been ported. A preliminary runnable image may be fetched from . Trent A. Fisher runs an unofficial Hurd page at . - Lites is a free 4.4BSD-based Unix server which runs on top of Mach. Lites provides binary compatibility with 4.4 BSD. NetBSD (0.8, 0.9, and 1.0), FreeBSD (1.1.5 and 2.0), 386BSD, UX (4.3BSD) and Linux on the i386 platform. It has also been ported to the pc532, and PA-RISC. Preliminary ports to the R3000 and Alpha processors have also been made. For more information, see the Lites home page at , and see also . -------------------------------------------------------------------------------- Subject: [1.4.3] Others From: Available software [93-03-18-10-19.02] Microsoft is making sources of Windows NT available under license to universities and research laboratories. You should have the appropriate officials contact to get started on this process. Patrick Bridges' operating systems home page at is an excellent source of information on a variety of other operating systems. -------------------------------------------------------------------------------- Subject: [2] Performance and workload studies From: Performance and workload studies This section covers various different publicly-available traces and studies, libraries and source distributions, which may be of use. -------------------------------------------------------------------------------- Subject: [2.1] TCP internetwork traffic characteristics From: Performance and workload studies - The Internet Traffic Archive is a moderated repository to support widespread access to traces of Internet network traffic. The traces can be used to study network dynamics, usage characteristics, and growth patterns, as well as providing the grist for trace-driven simulations. The archive is also open to programs for reducing raw trace data to more manageable forms, for generating synthetic traces, and for analyzing traces. The archive is available on the Web at . There you will find a description of the archive, its associated mailing lists, the moderation policy and submission guidelines, and the contents of the archive (traces and programs). - [92-10-20-15-04.39] Peter Danzig and Sugih Jamin of USC have made available a report and a source library which simulates realistic day-to-day network traffic between nodes. The library, tcplib, `is motivated by our observation that present-day wide-area tcp/ip traffic cannot be accurately modeled with simple analytical expressions, but instead requires a combination of detailed knowledge of the end-user applications responsible for the traffic and certain measured probability distributions'. The technical report and the source library it describes are available via anonymous ftp from . All you need to transfer to use the library are: README, brkdn_dist.h, tcpapps.h, tcplib.1, and one of libtcp* that matches your setup. You need tcplib.tar.Z only if you must generate the library yourself. The file tcplibtr.ps.Z is the PostScript version of the report. The authors may be contacted at . - [93-08-09-15-15.54] Vern Paxson of Lawrence Berkeley Laboratories has a report available via anonymous ftp which describes analytic models for wide-area TCP connections based upon a set of wide-area traffic traces. The report may be obtained from . - [93-05-13-10-54.09] Vern Paxson also has made available another report, , which provides an analysis of the growth trends of a medium-sized research laboratory's wide-area TCP connections over a period of more than two years. -------------------------------------------------------------------------------- Subject: [2.2] File system traces From: Performance and workload studies - Randy Appleton has a set of filesystem traces which detail every operation performed during a period of more than a week (several hundred thousand events). Timestamps on the traces are accurate to under a millisecond. For more details, contact the author, or visit . - Chris Ruemmler has done a study on low-level disk access patterns for a workstation, a server, and a time-shared system which appeared in the Winter 1993 USENIX proceedings. A copy may be obtained via anonymous ftp from . - Stephen Russell has instrumented the SunOS 4.1.x kernel running on Sun 3 machines. The system allows time-stamped event records to be obtained from various points in the kernel. Events can be categorised (eg, paging, file system, etc), and are read via pseudo-devices. Ioctl calls allow substreams to be enabled/disabled, buffer status checked, etc. An external high resolution timer is used for timestamping. - [93-05-09-09-23.32] The traces used in `Measurements of a distributed file system' (SOSP 1991) may be obtained from . -------------------------------------------------------------------------------- Subject: [2.3] Modern Unix file and block sizes From: Performance and workload studies The following sections are lifted more or less verbatim from a number of traces which were co-ordinated and analysed by Gordon Irlam . The numbers quoted below are based on Unix file size data for 12 million files, residing on 1000 file systems, with a total size of 250 gigabytes. Further information may be obtained on the World Wide Web at . -------------------------------------------------------------------------------- Subject: [2.3.1] File sizes From: Performance and workload studies There is no such thing as an average file system. Some file systems have lots of little files. Others have a few big files. However as a mental model the notion of an average file system is invaluable. The following table gives a break down of file sizes and the amount of space they consume. file size #files %files %files disk space %space %space (max. bytes) cumm. (Mb) cumm. 0 147479 1.2 1.2 0.0 0.0 0.0 1 3288 0.0 1.2 0.0 0.0 0.0 2 5740 0.0 1.3 0.0 0.0 0.0 4 10234 0.1 1.4 0.0 0.0 0.0 8 21217 0.2 1.5 0.1 0.0 0.0 16 67144 0.6 2.1 0.9 0.0 0.0 32 231970 1.9 4.0 5.8 0.0 0.0 64 282079 2.3 6.3 14.3 0.0 0.0 128 278731 2.3 8.6 26.1 0.0 0.0 256 512897 4.2 12.9 95.1 0.0 0.1 512 1284617 10.6 23.5 566.7 0.2 0.3 1024 1808526 14.9 38.4 1442.8 0.6 0.8 2048 2397908 19.8 58.1 3554.1 1.4 2.2 4096 1717869 14.2 72.3 4966.8 1.9 4.1 8192 1144688 9.4 81.7 6646.6 2.6 6.7 16384 865126 7.1 88.9 10114.5 3.9 10.6 32768 574651 4.7 93.6 13420.4 5.2 15.8 65536 348280 2.9 96.5 16162.6 6.2 22.0 131072 194864 1.6 98.1 18079.7 7.0 29.0 262144 112967 0.9 99.0 21055.8 8.1 37.1 524288 58644 0.5 99.5 21523.9 8.3 45.4 1048576 32286 0.3 99.8 23652.5 9.1 54.5 2097152 16140 0.1 99.9 23230.4 9.0 63.5 4194304 7221 0.1 100.0 20850.3 8.0 71.5 8388608 2475 0.0 100.0 14042.0 5.4 77.0 16777216 991 0.0 100.0 11378.8 4.4 81.3 33554432 479 0.0 100.0 11456.1 4.4 85.8 67108864 258 0.0 100.0 12555.9 4.8 90.6 134217728 61 0.0 100.0 5633.3 2.2 92.8 268435456 29 0.0 100.0 5649.2 2.2 95.0 536870912 12 0.0 100.0 4419.1 1.7 96.7 1073741824 7 0.0 100.0 5004.5 1.9 98.6 2147483647 3 0.0 100.0 3620.8 1.4 100.0 A number of observations can be made: - the distribution is heavily skewed towards small files - but it has a very long tail - the average file size is 22k - pick a file at random: it is probably smaller than 2k - pick a byte at random: it is probably in a file larger than 512k - 89% of files take up 11% of the disk space - 11% of files take up 89% of the disk space Such a heavily skewed distribution of file sizes suggests that, if one were to design a file system from scratch, it might make sense to employ radically different strategies for small and large files. The seductive power of mathematics allows us treat a 200 byte and a 2MB file in the same way. But do we really want to? Are there any problems in engineering where the same techniques would be used in handling physical objects that span 6 orders of magnitude? A quote from sci.physics that has stuck with me: `When things change by 2 orders of magnitude, you are actually dealing with fundamentally different problems'. People I trust say they would have expected the tail of the above distribution to have been even longer. There are at least some files in the 1-2G range. They point out that DBMS shops with really large files might have been less inclined to respond to a survey like this than some other sites. This would bias the disk space figures, but it would have no appreciable effect on file counts. The results gathered would still be valuable because many static disk layout issues are determined by the distribution of small files and are largely independent of the potential existence of massive files. (It should be noted that many popular DBMSs, such as Oracle, Sybase, and Informix, use raw disk partitions instead of Unix file systems for storing data, hence the difficulty in gathering data about them in a uniform way.) -------------------------------------------------------------------------------- Subject: [2.3.2] Block sizes From: Performance and workload studies The last block of a file is normally only partially occupied, and so as block sizes are increased so too will the the amount of wasted disk space. The following historical values for the design of the BSD FFS are given in `Design and implementation of the 4.3BSD Unix operating system': fragment size overhead (bytes) (%) 512 4.2 1024 9.1 2048 19.7 4096 42.9 Files have clearly gotten larger since then; I obtained the following results: fragment size overhead (bytes) (%) 128 0.3 256 0.6 512 1.1 1024 2.5 2048 5.4 4096 12.3 8192 27.8 16384 61.2 By default the BSD FFS typically uses a 1k fragment size. Perhaps this size is no longer optimal and should be increased. (The FFS block size is constrained to be no more than 8 times the fragment size. Clustering is a good way to improve throughput for FFS based file systems, but it doesn't do very much to reduce the not insignificant FFS computational overhead.) It is interesting to note that even though most files are less than 2K in size, having a 2K block size wastes very little space, because disk space consumption is so totally dominated by large files. -------------------------------------------------------------------------------- Subject: [2.3.3] Inode ratios From: Performance and workload studies The BSD FFS statically allocates inodes. By default one inode is allocated for every 2K of disk space. Since an inode consumes 128 bytes this means that by default 6.25% of disk space is consumed by inodes. It is important not to run out of inodes since any remaining disk space is then effectively wasted. Despite this allocating 1 inode for every 2K is excessive. For each file system studied I worked out the minimum sized disk it could be placed on. Most disks needed to be only marginally larger than the size of their files, but a few disks, having much smaller files than average, needed a much larger disk---a small disk had insufficient inodes. bytes per overhead inode (%) 1024 12.5 2048 6.3 3072 4.5 4096 4.2 5120 4.4 6144 4.9 7168 5.5 8192 6.3 9216 7.2 10240 8.3 11264 9.5 12288 10.9 13312 12.7 14336 14.6 15360 16.7 16384 19.1 17408 21.7 18432 24.4 19456 27.4 20480 30.5 Clearly, the current default of one inode for every 2K of data is too small. Earlier results suggested that allocating one inode for every 5-6k was in some sense optimal, and allocating one inode for every 8k would only be 0.4% worse. The new data suggests one inode for every 4k is optimal, and allocating one inode for every 8k would be 2.1% worse. The analysis technique I used is very sensitive to even a few file systems with very small files. The main source of file systems with lots of small files would appear to be netnews servers. The typical Usenet message would appear to be 1-2k in length. Ignoring such file systems would drastically alter the conclusions I reach. If, as I believe might already be the case, news servers are manually tuned to have a lower than normal bytes per inode ratio, it would then be possible to justify setting the default ratio much higher. Clearly it is best if the file system dynamically allocate inodes; I believe AIX does this for instance. Systems that statically allocate inodes should probably increase the bytes per inode ratio, but it is not clear to exactly what value. The engineer in me says `it is important to play this one conservatively: stick to 6k', the artist goes `as Chris Torek says: aesthetics, 8k'. -------------------------------------------------------------------------------- Subject: [3] Papers, reports, and bibliographies From: Papers, reports, and bibliographies Network-available documents are listed in this section. I'd like to see information for obtaining other sets of reports which aren't electronically-available included here as well, at some stage. -------------------------------------------------------------------------------- Subject: [3.1] From where are papers for distributed systems available? From: Papers, reports, and bibliographies Amoeba Arjuna Choices Chorus Clouds Cronus Mungi ExOS Flexmach Fox Guide Horus Isis Mach Nebula PEACE Plan 9 RTmach Spring SUNMOS / Puma Tigger X kernel / Scout Papers covering Amoeba, Choices, Chorus, Clouds, the Hurd, Guide, Mach, Mars, NonStop, and Plan 9 are also available via anonymous ftp from . [I'd like to find the authoritative home for V---Mars and NonStop are a bit more obscure, I think; they certainly aren't asked after much] -------------------------------------------------------------------------------- Subject: [3.2] Where can I find other papers? From: Papers, reports, and bibliographies Angel Apertos Cache kernel Hive Mungi KeyKOS Pegasus QNX [93-09-19-22-22.26] Solaris 2.x [93-02-23-12-12.43] SPIN Synthetix VSTa Windows NT [92-09-18-11-46.16] -------------------------------------------------------------------------------- Subject: [3.3] Where can I find bibliographies? From: Papers, reports, and bibliographies Distributed shared memory Load balancing Mobile computing Multimedia operating systems [94-04-15-23-29.51] Object-oriented operating systems Parallel and distributed I/O Sprite network operating system See also the section on General Net Resources. [There's quite a lot more at , if anyone wants to add more to this list.] -------------------------------------------------------------------------------- Subject: [4] General Internet-accessible resources From: General Internet-accessible resources This section contains information about a variety of services available to the OS research community via the Internet. -------------------------------------------------------------------------------- Subject: [4.1] Wide Area Information Service (WAIS) and World-Wide Web (WWW) servers From: General Internet-accessible resources [92-09-21-16-38.23] Loughborough University high-performance networking and distributed systems archive may be accessed via the World Wide Web at . This archive contains, according to Jon Knight , the organiser: - Technical reports and papers written at LUT by the networks and distributed systems researchers in the Department of Computer Studies. - Technical reports, papers and theses which have been produced at other sites and then made available for public electronic access. - Software which is of use in research or which has been produced by a specific research project. - Details of relevant conferences, collected from a variety of sources (USENET, email, flyers, etc). - Information on ongoing research projects. - Bibliographies that have been generated for research at LUT and also access to other WAIS indexed bibliographies, both at LUT and elsewhere. - A list of contacts in the field, with details of their research interests. This is entirely voluntary (i.e. people have agreed to Jon entering their details rather than him just rooting round the Internet to build up the information). Bibliographies in the comp.os.research collection are accessible via WAIS from UCSC. (:source :version 3 :ip-address "128.114.134.19" :ip-name "ftp.cse.ucsc.edu" :tcp-port 210 :database-name "os-bibliographies" :cost 0.00 :cost-unit :free :maintainer "paul@cse.ucsc.edu" :description "Server created with WAIS release 8 b5 on Jul 9 22:38:27 1992 by paul@cse.ucsc.edu The files of type bibtex used in the index were: /home/ftp/pub/bib" ) -------------------------------------------------------------------------------- Subject: [4.2] Refdbms---a distributed bibliographic database system From: General Internet-accessible resources [92-10-01-11-39.32] The 13th alpha release of refdbms version 3, developed by John Wilkes of the Concurrent Systems Project at Hewlett-Packard Laboratories and Richard Golding of the Concurrent Systems Laboratory at UC Santa Cruz, is now available. It can be obtained by anonymous ftp from . The system has been tested on Sun 3 and 4 systems running SunOS 4.1.x, and on DECstations running Ultrix 4.1. It is an experiment in building weak-consistency wide-area distributed applications, and the databases currently available for the system have a good systems coverage. The system includes tools to query the database, to produce bibliographies for LaTeX documents, and to enter new references into the database. It is part of ongoing research into wide-area distributed information systems on the Internet. Features include: - Distributed databases: a reference database can be shared among multiple sites. Updates can be entered at any site, and will be propagated to the other sites holding a replica of the database. - Multiple databases: every database has a name, and users specify the order in which databases will be searched. - Private databases: databases can be private, available site-wide, or they can be made available to other sites. - Database query by keyword, author, and title word. - Translator for refer-format databases. - Usable with LaTeX documents: the internal refdbms format can be translated into a special BibTeX format. An up-to-date list of bibliographies exported by various institutions may be obtained using anonymous ftp from . -------------------------------------------------------------------------------- Subject: [4.3] Willow -- the information looker-upper From: General Internet-accessible resources The University of Washington's Willow system provides a Motif-based user interface to a heterogeneous collection of on-line bibliographic databases. It will compile and run on most systems which provide a Motif library. For further information, see the Willow home page at . -------------------------------------------------------------------------------- Subject: [4.4] Computer science bibliographies and technical reports From: General Internet-accessible resources - A collection of bibliographies in various fields of computer science is available via anonymous ftp and the World Wide Web. The bibliographies contain about 260,000 references, most of which are references to journal articles, conference papers or technical reports. The collection has been formed by using various freely accessible services in the Internet (anonymous ftp, mailserver, wais, telnet) and converting each bibliography into a uniform BibTeX format. It is organised in files containing references to a (more or less) specific area within computer science. The database has been organised by Alf-Christian Achilles . It may be accessed on the Web at , via ftp from , and through a more useful search mechanism on the Web at . - As part of the ARPA Electronic Library Project, the Database Group at Stanford is providing a Selective Dissemination of Information (SDI) service to disseminate information about computer science technical reports. You can have a server email you periodic announcements of new papers on topics that interest you. See for details, or contact Tak Yan or the mail server itself at . -------------------------------------------------------------------------------- Subject: [4.5] The comp.os.research archive From: General Internet-accessible resources [93-02-18-21-18.31] An archive of all messages posted to comp.os.research since 1988 is maintained at UC Santa Cruz. It may be accessed via anonymous ftp at . The archive is organised by year. Postings may also be found via WAIS at UCSC's Computer Science gopher hole: (:source :version 3 :ip-address "128.114.134.19" :ip-name "ftp.cse.ucsc.edu" :tcp-port 210 :database-name "comp-os-research" :cost 0.00 :cost-unit :free :maintainer "paul@cse.ucsc.edu" :description "Server created with WAIS release 8 b5 on Jul 9 03:51:11 1992 by paul@cse.ucsc.edu The files of type netnews used in the index were: /home/ftp/pub/comp.os.research" ) -------------------------------------------------------------------------------- Subject: [4.6] Miscellaneous resources From: General Internet-accessible resources - Paul Harrington maintains a World Wide Web page on checkpointing, at . - Jay Lepreau has made available an electronic version of the proceedings of OSDI '94 at . Available are such things as - Papers: abstracts, papers, slides, bibtex entries, and for most, the actual software. - Keynote: audio and slides - Extensible OS panel: audio, slides, project URLs - Insularity panel: audio - Mach/Chorus workshop: TRs for most, slides, some software - Tutorials: slides for half, descriptions for all - Miscellaneous: summary report from ;login, list of works-in-progress talks, hard-copy proceedings ordering info, CFP, proceedings introduction, list of referees. -------------------------------------------------------------------------------- Subject: [5] Disclaimer and copyright From: Disclaimer and copyright Note that this document is provided as is. The information in it is not warranted to be correct; you use it at your own risk. Following recent reports on the list I think it wise to change the copyright: NOTICE OF COPYRIGHT AND PERMISSIONS Answers to Frequently Asked Questions for comp.os.research (hereafter referred to as These Articles) are Copyright (C) 1993, 1994, 1995, and 1996 by Bryan O'Sullivan . They may be reproduced and distributed in whole or in part, subject to the following conditions: - This copyright and permission notice must be retained on all complete or partial copies of These Articles. - These Articles may be copied or distributed in part or in full for personal or educational use. Any translation, derivative work, or copies made for other purposes must be approved by the copyright holder before distribution, unless otherwise stated. - If you distribute These Articles, instructions for obtaining the complete current versions of them free or at cost price must be included. Redistributors must make reasonable efforts to maintain current copies of These Articles. Exceptions to these rules may be granted, and I shall be happy to answer any questions about this copyright notice -- write to Bryan O'Sullivan, PO Box 62215, Sunnyvale, CA 94088-2215, USA or email . These restrictions are here to protect the contributors, not to restrict you as educators and learners. ------------------------------------------------------------------------ Comp.os.research: Frequently answered questions [3/3: l/m 13 Aug 1996] -------------------------------------------------------------------------------- From: bos@serpentine.com (Bryan O'Sullivan) Newsgroups: comp.os.research Subject: Comp.os.research: Frequently answered questions [3/3: l/m 13 Aug 1996] Date: 1 Nov 1997 10:00:34 GMT Message-ID: <63euk2$iv@darkstar.ucsc.edu> Reply-To: os-faq@cse.ucsc.edu Summary: frequent topics of discussion on the operating systems research group Archive-name: os-research/part3 Version: $Revision: 1.3 $ Posting-Frequency: monthly Last-Modified: Tue Aug 13 21:03:20 1996 URL: http://www.serpentine.com/~bos/os-faq/ Answers to frequently asked questions for comp.os.research: part 3 of 3 Copyright (C) 1994--1996 Bryan O'Sullivan TABLE OF CONTENTS 1. Distributed systems 1.1. What is the current status of the (insert name) project? 1.2. How do approaches to load balancing differ? 1.3. Fault tolerance in distributed systems 1.4. Naming in distributed systems 1.5. Distributed shared memory 1.5.1. Data consistency 1.5.1.1. Strictly consistent systems 1.5.1.2. Relaxing consistency 1.5.1.3. Application-specific coherence 1.5.2. Access synchronisation 1.5.3. Transfer and caching granularity 1.5.4. Address space structure 1.5.5. Fault tolerance 1.5.6. A brief bibliography on distributed shared memory 1.6. What have we learned? 2. Needful things -------------------------------------------------------------------------------- Subject: [1] Distributed systems From: Distributed systems A great deal of the high-profile research carried out in operating systems these days deals with distributed computing. Not surprisingly, discussions of distributed systems make up a large amount of the traffic on comp.os.research. -------------------------------------------------------------------------------- Subject: [1.1] What is the current status of the (insert name) project? From: Distributed systems See the section on `available software' for information on distributions of some of the systems mentioned here. - The Amoeba project is still going. There are roughly 20 people working on it, but most of these are no longer kernel hackers. They are working on using it for parallel programming, wide-area distributed systems, and other things. Amoeba is used in over 100 universities at the moment, and is also used at commercial institutions. - Brazil is the new research operating system being developed at AT&T Bell Labs. Research topics being addressed in Brazil center on higher-performance machines and, particularly, networks. A new in-house 300 megabit/s switched fiber network increases the potential bandwidth between machines by at least an order of magnitude; our aim is to realize and exploit that bandwidth. The overall design is to eliminate unnecessary overhead, particularly by restructuring and redesigning where necessary to avoid copying data from element to element along the communications path. Most of this software (except the operating system kernel) is written in a new concurrent systems programming language, Alef, which makes it easy to write multi-process servers and applications that can communicate using messages or shared memory, as appropriate. A paper on Alef is available from the Plan 9 ftp site; see part 2 of this FAQ for a pointer. - Cronus is still under development at BBN. The current public release is 3.0. The project currently has two thrusts---as the base for advanced distributed system R&D, and as a platform for constructing and deploying sophisticated distributed applications. Ongoing research topics include the integration of Cronus and Mach technology, the exploration of techniques for the construction of WAN-based and multi-organisational applications, investigation into the integration of distributed systems and network management systems, and work in high-performance distributed computing. - Horus is being developed by the same group that worked on Isis; the head of this group is Robbert van Renesse. - Isis is no longer being developed at Cornell; it is now managed as a commercial product. - Mach is no longer being developed at CMU. Current work on Mach is being carried out by the OSF Research Institute and at the University of Utah. - Plan 9 is no longer in development at AT&T Bell Labs. fibre-optic network. The operating systems research group at Bell Labs has moved on to a new project, called Brazil, which addresses portable computing and distributed applications programming. - QNX is a commercial POSIX-certified realtime OS with an installed base of over 250,000 systems. It is used extensively in process control, factory automation, medical instrumentation, communications and point-of-sale. A number of universities are also doing research with QNX. - The Sprite network operating system project has ended. -------------------------------------------------------------------------------- Subject: [1.2] How do approaches to load balancing differ? From: Distributed systems Load-balancing policy falls into two broad groups: static and dynamic. Static policies use algorithms which operate without regard to run-time loads across a system, while dynamic policies use the run-time performance of various parts of a system in order to make more `informed' decisions about balancing. [92-11-06-12-53.57] A dynamic load-balancing policy is one which uses run-time state information in making scheduling decisions. There are two kinds of dynamic policies: adaptive and non-adaptive. The latter always use the same (fixed, load-dependent) policy; the former may adjust policy parameters in order to gradually improve their performance. The key point is that while non-adaptive policies use only the information about the run-time state, adaptive policies use, in addition to this, information about current performance. In adaptive policies, the rules for adjusting policy parameters may be static or dynamic. An example of the former might be: `shift to a conservative migration rule when system-wide load patterns are varying too rapidly'. An example of the latter could be: `increase sender-side threshold when migrated jobs cause slowdown rather than speedup'. Some researchers refer to the performance-driven adaptation exhibited by the second policy as `learning'. Since both non-adaptive policies and adaptive policies with static rules really use only load information, it is confusing to distinguish between them. One way to avoid such confusion is to restrict the use of the word `adaptive' to policies that use performance feedback in order to drive their adjustment of policy parameters. -------------------------------------------------------------------------------- Subject: [1.3] Fault tolerance in distributed systems From: Distributed systems One approach to providing fault tolerance in distributed systems involves the use of redundant services, such that standby facilities can become active in the event of the failure of, or loss of connection to, a primary service. Another approach is to provide multiple paths of connectivity between the computers that make up the distributed system. The QNX system, for example, supports multiple network drivers per node. The purpose of the network connection under QNX is to merge the microkernels on the LAN into a single logical kernel. Hence, if multiple LAN connections per node are present, the networking code can load balance the LAN traffic on the paths available. It can also route around failed links, providing both greater LAN bandwidth and better fault tolerance. See below for treatment of fault tolerance in systems which make use of distributed shared memory. -------------------------------------------------------------------------------- Subject: [1.4] Naming in distributed systems From: Distributed systems [Material on naming and/or global naming sought.] -------------------------------------------------------------------------------- Subject: [1.5] Distributed shared memory From: Distributed systems Distributed computer systems have evolved using message passing as their main method of communication. Other communication systems used in loosely coupled distributed systems, such as RPC, are usually implemented on top of an underlying message passing system. On the other hand, in tightly coupled systems, such as a multi-processor machine, the communication method used is usually shared memory. In distributed shared memory (DSM) systems [Nitzberg & Lo, 91], processes share data transparently across node boundaries; data faulting, location, and movement is handled by the underlying system. Among other things, this allows parallel programs designed to use shared memory to execute transparently on a loosely coupled distributed system. While the performance implications cannot be ignored, the advantages of the shared memory programming model are well known: - Shared memory programs are usually shorter and easier to understand than equivalent message passing programs. - Large or complex data structures may easily be communicated. - Shared memory gives transparent process-to-process communication. - Programming with shared memory is a well-understood problem. Shared-memory (or `procedure-oriented') and message-oriented operating systems are, in some sense, equivalent [Lauer & Needham, 78], though it has been claimed that the former are `more powerful' [Tam et al., 90]. -------------------------------------------------------------------------------- Subject: [1.5.1] Data consistency From: Distributed systems Despite recent advances in both local and wide-area networking technologies, network latency is still a major factor in distributed systems and likely to remain so. All DSM systems provide some sort of caching in an attempt to improve the performance beyond that provided by doing a network access on every reference to a non-local data item. Each system must decide whether or not to attempt to keep the data coherent, and, if so, what coherence strategy to use. The coherence semantics which may be provided to the programmer include: - `strict' consistency, where a read always returns the value written by the most recent write - a `loosely' consistent system where the system enforces some form of weak consistency guarantees and the application (or compiler or user) can indicate synchronisation points where consistency must be enforced; - no automatic consistency mechanism, but provide the user with the facilities necessary to implement user level synchronisation and consistency. -------------------------------------------------------------------------------- Subject: [1.5.1.1] Strictly consistent systems From: Distributed systems Older, strictly consistent systems tend to enforce a single writer, multiple reader model, where at any time data will be held either at a single node (which may have write access) or several nodes (none of which may have write access). Given this model, we must be able to locate a copy of our data when it is not resident. The method most frequently used is to assign an `owner' to each item of data, where the owner has either the only writeable copy of the data, or one of the read-only copies. Ownership may remain fixed throughout the life of a datum, or it may change dynamically. In the latter case, the problem arises of locating the owner. A database of locations may be maintained by centralised managers, or ownership information can be distributed among nodes of the system [Li and Hudak, 89]. In a strictly consistent system, we must also be able to synchronise writes. The two major solutions to this problem are: - Write broadcast. The effects of every write are broadcast to ever node that has a copy of the data being written; this effectively implements a replication algorithm. Write broadcast is usually considered too expensive to be used as a general solution. - Write invalidation. Each node in the system holding a read-only copy of the data being written is sent an invalidation message. -------------------------------------------------------------------------------- Subject: [1.5.1.2] Relaxing consistency From: Distributed systems Permitting temporary inconsistencies is a common method of increasing performance in distributed systems. Memory is said to be loosely coherent if the value returned by a read operation is the value written by an update operation to the same object that `could' have immediately preceded the read operation in some legal schedule of the threads in execution [Bennett et al., 90]. Using loose coherence, more than one thread may have write access to the same object, provided that the programmer knows that the writes will not conflict. Another memory consistency model is `release consistency' [Gharachorloo et al., 90], in which memory accesses are divided into ordinary and synchronisation-related accesses. The latter are further divided into `acquire' and `release' operations. The `acquire' operation indicates that shared data is needed, and a processor's updates are not guaranteed to be performed at other nodes until a `release' is performed. The primary advantage of this form of consistency is that it allows consistency updates to be tied to synchronisation events, and therefore to be delayed until actually needed by applications. However, most release consistent systems require the programmer to make explicit use of `acquire' and `release' operations. A DSM system called Midway introduces another new consistency model, `entry consistency' [Bershad et al., 93]. Entry consistency is weaker than many of the other models suggested, including release consistency; it requires explicit annotations to associate synchronisation objects and data. On an `acquire', only the data associated with the synchronisation object is guaranteed to be consistent. This extra weakness permits higher performance implementations of the underlying consistency protocols to be written. Midway also supports stronger consistency models, so that the application programmer can trade-off performance against the extra effort required to write entry consistent programs. -------------------------------------------------------------------------------- Subject: [1.5.1.3] Application-specific coherence From: Distributed systems From [Cheriton, 86]: `Problem-oriented shared memory' is a shared memory that implements fetch and store operations specialised to the particular problem or application it is supporting. In particular, a problem-oriented shared memory commonly provides a specialised form of consistency and consistency maintenance that exploits application-specific semantics. Cheriton goes on to propose that consistency constraints be relaxed and more use be made of problem semantics. He suggests that, in some cases, stale data may be detected on use by the client, and the client may then recover. A example would be hint caching. In some applications, stale data may actually be sufficiently accurate, provided that the client can obtain up to date information when necessary. In other applications, some data may be optional in the sense that the client can continue without it. Other applications may tolerate having the results of store operations being lost or undone, for example, an application that regularly updates the entire data set. Another approach is presented by the designers of Munin, where the runtime system accepts hints from the compiler or user to determine the coherence mechanism to be used for each object. The default, in the absence of hints, is to use a general read-write consistency mechanism, much like that employed by IVY. Munin supports several different object types that are based on the results of a survey of shared memory access characteristics. The results of the survey showed that a very small percentage of all accesses to shared data fall under the general read-write type. The Munin designers also note that a program moves through various stages of execution, and the types associated with objects change as time progresses -------------------------------------------------------------------------------- Subject: [1.5.2] Access synchronisation From: Distributed systems Most parallel applications will use some sort of synchronisation system to order and control accesses to shared data before actually accessing the data. The most important thing to note in DSM systems is that just blindly using standard test and set operations on bytes in shared pages will produce a high fault rate; faults are usually expensive, making this approach unacceptable. Clouds merges locking with the cache consistency protocol, so that the user may obtain both a lock and the data in one network transaction. This system has the advantage that no invalidation messages are required, since the granting of the lock guarantees that there are no conflicting copies; it has the disadvantage that an explicit unlock/discard operation is required to release access to the data. This is acceptable in Clouds, as the DSM system was designed specifically to support object invocation, so it is easy to discard on a return. Munin provides a distributed lock mechanism using `proxy objects' to reduce network load. Proxy objects are maintained by a lock server on each node; when a thread wants to obtain a lock on an object, it attempts to lock the proxy instead. The server obtains the global lock if it is not already held locally. Global locking is done by negotiating with all the other lock servers in the system. Each lock may be migrated from server to server, and part of the Munin system allows objects to be migrated along with their locks. Other systems, such as IVY and Mermaid, use modified versions of classic multiprocessor synchronisation facilities. -------------------------------------------------------------------------------- Subject: [1.5.3] Transfer and caching granularity From: Distributed systems When caching objects in local memory, it is necessary to decide what level of granularity to use. All current systems use a fixed block size in the cache, rather than varying the granularity based on object size. Usually this is due to constraints imposed by the system hardware and memory management. The choice of the block size in the cache depends on several issues. - Cost of communication: for example, on many local area networks there is little difference between the time required to send a one-byte message and that required to send a 1024-byte message. Transmitting bulk changes rather than single-byte modifications would therefore seem desirable. - The choice of granularity also depends on the locality of reference in the application, as thrashing may occur when two machines are both accessing the same block (this is also known as the `ping-pong effect'). This would seem to argue for a smaller block size. It should be noted that many object-oriented systems exhibit very poor locality of reference. In practice, a compromise must be achieved, as with conventional virtual memory systems. Most systems use a block size which is the same as that of the virtual memory management unit on the system, or a multiple thereof. Among other things, it allows the hardware to be used to help in the maintenance of consistency. The choice is complicated somewhat when heterogeneous machines are being used, but in these cases, the lowest common multiple of hardware supported page sizes can usually be used. The only major system that doesn't use a large block size is Memnet, in which a hardware based DSM system was implemented on a high speed token ring; a 32-byte block size was used instead [Delp & Farber]. The choice of a small block size is appropriate, as the system is much closer to a shared memory multi-processor than it is to a software DSM system. This is because the entire processor is blocked on a cache miss; the processor is not actually aware of the distributed nature of its address space. Also, the ratio between remote and local memory access times is much lower than in the software based systems due to the dedicated token ring (200Mbps) and hardware assistance. -------------------------------------------------------------------------------- Subject: [1.5.4] Address space structure From: Distributed systems In a single shared address space system, the system appears as a set of threads executing in a shared distributed address space. Objects always appear at the same addresses on all nodes. Single address space systems have had a resurgence in popularity with the arrival of 64-bit processors. A number of researchers believe that a 64-bit address space is large enough to act as a single global address space for all the memory (both primary and secondary) in a distributed system. Examples of such systems include Angel, Mungi, and Opal. Security and protection are a major problem in such systems, and current approaches either rely on hardware assistance or stochastic algorithms, or ignore the problem. Another approach is to divide each process's address space into different fixed regions, some of which are private and not shared, and some of which are shared with some other processes. Ra, the Clouds kernel, takes this approach using O, P, and K address regions, with the O region shared between all processes executing in a given object; the P and K regions are local to a process and kernel, respectively. Here objects always appear at the same address but may not be visible from every address space. By contrast, some systems, including Mirage and Mach, allow shared data to exist at differing addresses in different processes address spaces. However, neither system does transparent pointer translation, so the address changes are not entirely transparent to the application. As for the structuring of the shared region itself, some systems -- for example, IVY and Mether -- use a single flat region: one continuous range of virtual addresses represent the shared address space and are managed by the DSM system. This single address space is usually sub-divided into pages. Most systems use paged segmentation: the shared region consists of disjoint pieces, which are usually managed separately and are not all mapped in any one process. Frequently, the segments (sometimes called memory objects, or windows) are related to the backing store. For example, in Clouds, the object address space consists of windows onto larger segments; these segments are usually maintained on secondary storage. -------------------------------------------------------------------------------- Subject: [1.5.5] Fault tolerance From: Distributed systems Most DSM systems ignore the fault tolerance issue or maintain that it is an operating system issue and should be handled by the underlying system. However, it would appear that in practice a DSM system would strongly effect the fault tolerance of a system. For example, in a system where several systems are sharing access to a set of data, the failure of any one of them could lead to the failure of all the connected sites (or, at least, some of the processes on each site). We are also presented with an unusual failure handling problem. It is fairly easy to see how to handle a failed message or RPC, but how do you handle a failed page fault? The original Clouds system provided recoverability using shadowing of segments and a transactional system using commits. The recovery system was not really integrated with the DSM system and was merely implemented at the segment storage site. In order to maintain a consistent view of data when one transaction is active at multiple nodes, they have more recently been forced to integrate the transaction system with the DSM support system. -------------------------------------------------------------------------------- Subject: [1.5.6] A brief bibliography on distributed shared memory From: Distributed systems [Nitzberg & Lo, 1991] Nitzberg, W. and Lo, V., `Distributed shared memory: a survey of issues and algorithms', IEEE Computer, August 91, pp. 52-60 [Lauer & Needham, 1978] [Tam et al., 90] Tam, M.-C., Smith, J. M. & Farber, D. J., `A taxonomy-based comparison of several distributed shared memory systems', ACM Operating Systems Review 24(3), July 90, pp. 40-67 [Li and Hudak, 89] Li, K. & Hudak, P., `Memory coherence in shared virtual memory systems', ACM Transactions on Computer Systems 7(4), November 89, pp. 321-359 [Bennett et al., 90] Bennett, J. K., Carter, J. B. & Zwaenopoel, W., `Munin: distributed shared memory based on type-specific memory coherence', Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, SIGPLAN Notices 25(3), March 90, pp. 168-176 [Gharachorloo et al., 90] Gharachorloo, K., et al., `Memory consistency and event ordering in scalable shared-memory multiprocessors', ACM SIGARCH News 18(2), June 90 [Bershad et al., 93] Bershad, B. N., et al., `The Midway distributed shared memory system', Technical Report CMU-CS-93-119, School of Computer Science, Carnegie Mellon University, 1993. Available via anonymous ftp from . [Cheriton, 86] Cheriton, D. R., `Problem-oriented shared memory: a decentralized approach to distributed system design', Proceedings of the 6th International Conference on Distributed Computing Systems, May 86, pp. 190-197 [Delp & Farber] Delp, G. S. & Farber, D. J., `Memnet -- a different approach to a network', Technical Report, Department of Electrical Engineering, University of Delaware, ??? -------------------------------------------------------------------------------- Subject: [1.6] What have we learned? From: Distributed systems Andy Tanenbaum started a (very long) thread on this topic in comp.os.research in April of 1992 [92-04-03-17-10.05]. The interested reader is directed to the comp.os.research archives, since this thread proved rather divisive (i.e. nobody really agreed on any issue). -------------------------------------------------------------------------------- Subject: [2] Needful things From: Needful things This FAQ is incomplete, and will probably remain in this state to a greater or lesser extent for ever and ever. Should you feel willing to contribute some material, the following is a list of topics which ``urgently'' require treatment (some of which I may get around to covering myself at some point): - naming in distributed systems