ftp.cse.unsw.edu.au

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cse.unsw.edu.au / 2014.06.ftp.cse.unsw.edu.au.tar / ftp.cse.unsw.edu.au / pub / doc / papers / misc / cs.toronto.edu:programming / c-news.doc < prev next >

Wrap

Text File | 1992-10-18 | 47.0 KB | 1,072 lines

NNNNeeeewwwwssss NNNNeeeeeeeedddd NNNNooootttt BBBBeeee SSSSlllloooowwww _G_e_o_f_f _C_o_l_l_y_e_r _D_e_p_a_r_t_m_e_n_t _o_f _S_t_a_t_i_s_t_i_c_s* _U_n_i_v_e_r_s_i_t_y _o_f _T_o_r_o_n_t_o _u_t_z_o_o!_u_t_c_s_r_i!_u_t_f_r_a_s_e_r!_g_e_o_f_f _H_e_n_r_y _S_p_e_n_c_e_r _Z_o_o_l_o_g_y _C_o_m_p_u_t_e_r _S_y_s_t_e_m_s _U_n_i_v_e_r_s_i_t_y _o_f _T_o_r_o_n_t_o _u_t_z_o_o!_h_e_n_r_y _A_B_S_T_R_A_C_T C news is a re-write, from scratch, of the `transport layer' of the Usenet software. C rnews runs at over 19 times the speed of B rnews; C expire runs in minutes rather than the hours taken by B expire. These performance improvements were (mostly) quite simple and straightforward, and they exemplify general principles of performance tuning. _1. _H_i_s_t_o_r_y _a_n_d _M_o_t_i_v_a_t_i_o_n In the beginning (of Usenet) (1979) was A news, written at Duke University by Steve Bellovin, Stephen Daniel, Tom Truscott and others. A single program, _n_e_w_s, received, relayed, perused and cleaned out news articles. All arti- cles were stored in a single UNIX|- directory, which made A news suitable for local news and low volumes of network news. News articles were exchanged using a simple message format in which the first five lines of a message contained the information nowadays found in the article headers: unique article-id, newsgroup(s), author, subject, date posted. As Usenet began to grow (1981), people in and around the University of California at Berkeley, including Matt Glickman and Mark Horton, modified A news extensively. The _________________________ * Work done mostly while at U of T Computing Services. |- UNIX is a registered trademark of AT&T. April 18, 1990 - 2 - articles of each newsgroup were now stored in a separate directory. The message format was changed from the rigid and inextensible A news header format to one conforming to ARPA RFC 822 (the current ARPA mail-message format stan- dard). _N_e_w_s was broken into separate programs: _r_e_a_d_n_e_w_s, _i_n_e_w_s (aka _r_n_e_w_s), and _e_x_p_i_r_e. The authors dubbed the result ``B news''. Since the release of B news, it has replaced A news almost|= everywhere on Usenet. It soon became clear that sending individual articles from machine to machine as separate _u_u_c_p transactions was unacceptably slow, in part because it produced large _u_u_c_p spool directories, which are searched quite slowly* by the kernel. Sites began to _b_a_t_c_h articles into batches of (typ- ically) 50,000-100,000 bytes for transmission to other machines. At about this time, B news was changed to file news articles in a tree, as (for example) /usr/spool/news/net/women/only/123, rather than as /usr/spool/news/net.women.only/123. The motive for this was primarily elimination of problems with long newsgroup names, but shortening directories (and thus speeding searches) was also a consideration. As Usenet traffic continued to grow explosively, sites began to use data compression on news batches. The main objective was to reduce expensive long-distance phone time, but again performance improved a bit: the extra processor time used for compression and decompression was more than gained back by the reduction in processor time used by _u_u_c_p itself. Unfortunately, B news has been modified by many people since 1981, and has mutated repeatedly to match the changing nature of Usenet. It has become complex, slow, and diffi- cult to maintain. During 1985, we observed that the nightly arrival of new news and expiry of old news were consuming resources far out of proportion to the volume of data involved|-. _E_x_p_i_r_e often ran for 90 minutes, and _r_n_e_w_s processing averaged 10 seconds or more per article. Both programs tended to crip- ple systems by performing much disk i/o and eating much system-mode CPU time. _U_t_c_s was running B 2.10.1 news then _________________________ |= AT&T Bell Laboratories Research still runs A news for local newsgroups. * Recent _u_u_c_ps (notably Honey DanBer) provide spool sub-directories, and recent 4BSD (4.3BSD and later) kernels provide linear (as opposed to quadratic) directory searching, both of which help this problem. |- Never mind the cost/benefit ratio. April 18, 1990 - 3 - and _u_t_z_o_o was running B 2.10 news. Although newer B news releases were available, they offered little beyond B 2.10, and it was often necessary to regression-test new B news releases to verify that reported, published bug fixes had in fact been applied. Spencer acted first and rewrote _e_x_p_i_r_e from the ground up. Though it initially lacked any form of selective expiry, this _e_x_p_i_r_e, when run each night, finished in about 15 minutes. (This was on 750-class machines holding all Usenet news and expiring after 14 days.) Collyer observed in November 1985 that B _r_n_e_w_s, upon receiving a batch of news, immediately _e_x_e_ced a trivial unbatcher which copied each article into a temporary file and then forked and _e_x_e_ced B rnews again. Such a technique is clearly overkill for articles averaging about 3,000 bytes each. Preliminary experiments failed to produce a modified B rnews that could unravel a batch without forking. Consul- tation with Rick Adams, the current B-news maintainer, revealed that this same technique remained in the upcoming B news release (variously B 2.10.3 or B 2.11). Within one week|=, a from-scratch C _r_n_e_w_s prototype was working well enough to run experimentally on a `leaf' machine receiving a subset of news. This prototype version lacked a good many necessary amenities, and over the next eight months it was enhanced to bring it up to full functionality. It was also tuned heavily to improve its performance, since it was faster than B _r_n_e_w_s but still not fast enough to make us happy. Once the _r_n_e_w_s newsgroup name matching routines were working, Spencer revised _e_x_p_i_r_e to add selective expiry, specified in a control file. Recently, we have also revised our old batcher heavily, largely to add capability but with an eye on performance. _2. _R_n_e_w_s _P_e_r_f_o_r_m_a_n_c_e The basic objective of C news was simpler code and higher performance. This may sound trite, but note that performance was an explicit objective. That was important. _P_r_o_g_r_a_m_s _w_i_l_l _s_e_l_d_o_m _r_u_n _f_a_s_t_e_r _u_n_l_e_s_s _y_o_u _c_a_r_e _a_b_o_u_t _m_a_k_i_n_g _t_h_e_m _r_u_n _f_a_s_t_e_r. `Faster' implies comparison to a slower version. Know- ing the value of improvements, and assessing this in rela- tion to their cost, requires knowing the performance of the unimproved version. Collyer kept detailed records of his work on _r_n_e_w_s, so he could see how much progress he was _________________________ |= 40 hours, Collyer didn't have to work hard. April 18, 1990 - 4 - making. See the Appendix for the final result. _T_o _k_n_o_w _h_o_w _t_o _g_e_t _s_o_m_e_w_h_e_r_e, _y_o_u _m_u_s_t _k_n_o_w _w_h_e_r_e _y_o_u _a_r_e _s_t_a_r_t_i_n_g _f_r_o_m. The first functional C _r_n_e_w_s ran at about 3 times the speed of B _r_n_e_w_s. We had assumed that merely eliminating the fork/exec on each article would give a factor of 10 improvement, so this was disappointing. _A_v_o_i_d_i_n_g _o_b_v_i_o_u_s _p_e_r_f_o_r_m_a_n_c_e _d_i_s_a_s_t_e_r_s _h_e_l_p_s... _b_u_t _i_t'_s _n_o_t _a_l_w_a_y_s _e_n_o_u_g_h. Profiling, first with _p_r_o_f(1) and later with 4.2BSD's _g_p_r_o_f(1), and rewriting of the bottlenecks thus discovered, eventually brought the speed up to over 19 times the speed of B _r_n_e_w_s. This required a number of write-profile-study- rewrite cycles. There is undoubtedly still a lot of code which could be faster than it is, but since profiling shows that it doesn't have a significant impact on overall perfor- mance, who cares? _T_o _l_o_c_a_t_e _p_e_r_f_o_r_m_a_n_c_e _p_r_o_b_l_e_m_s, _l_o_o_k _t_h_r_o_u_g_h _t_h_e _e_y_e_s _o_f _t_h_y _p_r_o_f_i_l_e_r. Collyer first experimented with using _r_e_a_d and _w_r_i_t_e system calls instead of _f_r_e_a_d and _f_w_r_i_t_e, and got a substan- tial saving. Though the usage of system calls in this experiment was unportable, the saving eventually lead him to rewrite _f_r_e_a_d and _f_w_r_i_t_e from scratch to reduce the per-byte overheads. This helped noticeably, since pre-System-V _f_r_e_a_d and _f_w_r_i_t_e are really quite inefficient. _I_f _t_h_y _l_i_b_r_a_r_y _f_u_n_c_t_i_o_n _o_f_f_e_n_d_s _t_h_e_e, _p_l_u_c_k _i_t _o_u_t _a_n_d _f_i_x _i_t. At the time, C _r_n_e_w_s was doing fairly fine-grain lock- ing, essentially locking each file independently on each use. News doesn't need the resulting potential concurrency, especially when _r_n_e_w_s runs relatively quickly, and the lock- ing was clearly a substantial fraction of the execution time. C _r_n_e_w_s was changed to use B-news compatible locking, with a single lock for the news system as a whole. _S_i_m_p_l_i_- _c_i_t_y _a_n_d _s_p_e_e_d _o_f_t_e_n _g_o _t_o_g_e_t_h_e_r. When sending articles to a site using batching, _r_n_e_w_s just appends the filename of each article to a _b_a_t_c_h _f_i_l_e for that site. The batch file is later processed by the batcher. In principle, batching is an option, and different sites may get different sets of newsgroups. In practice, few articles are ever sent unbatched, and most articles go to all sites fed by a given system. This means that _r_n_e_w_s is repeatedly appending lines to the same set of batch files. Noticing this, Collyer changed C _r_n_e_w_s to keep these files open, rather than re-opening them for every article*. _O_n_c_e _y_o_u'_v_e _g_o_t _i_t, _h_a_n_g _o_n_t_o _i_t. _________________________ * The price for this tactic is that the code has to be prepared for the possibility that the number of sites being fed exceeds the supply of file descriptors. Fortunately, that is rare. April 18, 1990 - 5 - These two simple changes-coarser locking and retaining open files-cut system time by about 20% and real time by still more. On return from Christmas holidays, after considerable agonizing over performance issues, Collyer turned some small, heavily-used character-handling functions into mac- ros. This reduced user-mode time quite a bit. _A _f_u_n_c_t_i_o_n _c_a_l_l _i_s _a_n _e_x_p_e_n_s_i_v_e _w_a_y _t_o _p_e_r_f_o_r_m _a _s_m_a_l_l, _q_u_i_c_k _t_a_s_k. _R_n_e_w_s was always looking up files by full pathnames. Changing it to _c_h_d_i_r to the right place and use relative names thereafter reduced system time substantially. _A_b_s_o_- _l_u_t_e _p_a_t_h_n_a_m_e_s _a_r_e _c_o_n_v_e_n_i_e_n_t _b_u_t _n_o_t _c_h_e_a_p. Studying the profiling data revealed that _r_n_e_w_s was spending a lot of time re-re-re-reading the _s_y_s and _a_c_t_i_v_e files. These files are needed for processing every article, and they are not large. Collyer modified _r_n_e_w_s to simply read these files in once and keep them in core. This change alone cut system time and real time by roughly 30%. _A_g_a_i_n, _o_n_c_e _y_o_u'_v_e _g_o_t _i_t, _d_o_n'_t _t_h_r_o_w _i_t _a_w_a_y! There is a more subtle point here, as well. When these files were re-read every time, they were generally processed a line at a time. The revised strategy was to _s_t_a_t the file to determine its size, _m_a_l_l_o_c enough space for the whole file, and bring it in with a single _r_e_a_d. This is a vastly more efficient way to read a file! _T_a_s_k_s _w_h_i_c_h _c_a_n _b_e _d_o_n_e _i_n _o_n_e _o_p_e_r_a_t_i_o_n _s_h_o_u_l_d _b_e. At this point (mid-January 1986), C _r_n_e_w_s was faster than B _r_n_e_w_s by one order of magnitude, and there was much rejoicing. In principle, the `Newsgroups:' header line, determin- ing what directories the article will be filed in, can be arbitrarily far from the start of the article. In practice, it is almost always found within the first thousand bytes or so. By complicating rnews substantially, it became possible in most cases to _c_r_e_a_t the file in the right place (or the first of the right places) in /_u_s_r/_s_p_o_o_l/_n_e_w_s before writing any of the article to disk, eliminating the need for tem- porary files or even temporary links. The improvement in system time was noticeable, and the improvement in user time was even more noticeable. _P_r_e_p_a_r_e _f_o_r _t_h_e _w_o_r_s_t _c_a_s_e, _b_u_t _o_p_t_i_m_i_z_e _f_o_r _t_h_e _t_y_p_i_c_a_l _c_a_s_e. There are certain circumstances, notably control- message articles, in which it is necessary to re-read the article after filing it. _R_n_e_w_s originally re-opened the article to permit this. Changing the invocation of _f_o_p_e_n to use the wwww++++ mode made it possible to just seek back to the beginning instead, which is _m_u_c_h faster. This, plus some April 18, 1990 - 6 - similar elimination of other redundant calls to _o_p_e_n, reduced system time by over 30%. _G_e_t _a_s _m_u_c_h _m_i_l_e_a_g_e _a_s _p_o_s_s_i_b_l_e _o_u_t _o_f _e_a_c_h _c_a_l_l _t_o _t_h_e _k_e_r_n_e_l. Both scanning the in-core _a_c_t_i_v_e and _s_y_s files and re- writing the _a_c_t_i_v_e file are simpler if the in-core copies are kept exactly as on disk, but this implied frequent scans to locate the ends of lines. It turned out to be worthwhile to pre-scan the _a_c_t_i_v_e file for line boundaries, and remember them. _W_h_e_n _s_t_o_r_i_n_g _f_i_l_e_s _i_n _a_n _u_n_s_t_r_u_c_t_u_r_e_d _w_a_y, _a _l_i_t_t_l_e _r_e_m_e_m_b_e_r_e_d _i_n_f_o_r_m_a_t_i_o_n _a_b_o_u_t _t_h_e_i_r _s_t_r_u_c_t_u_r_e _g_o_e_s _a _l_o_n_g _w_a_y _i_n _s_p_e_e_d_i_n_g _u_p _a_c_c_e_s_s. We already had a _S_T_R_E_Q macro, just a simple invocation of _s_t_r_c_m_p, as a convenience. As a result of some other experience by Spencer, Collyer tried replacing some calls of _s_t_r_n_c_m_p by a _S_T_R_E_Q_N macro, which compared the first charac- ter of the two strings in-line before incurring the overhead of calling _s_t_r_n_c_m_p. This sped things up noticeably, and later got propagated through more and more of the code. String-equality tests usually fail on the very first charac- ter. _T_e_s_t _t_h_e _w_a_t_e_r _b_e_f_o_r_e _t_a_k_i_n_g _t_h_e _p_l_u_n_g_e. While looking at string functions, Collyer noticed that _s_t_r_n_c_m_ps to determine whether a line was a particular header line had the comparison length computed by applying _s_t_r_l_e_n to the prototype header. With a little bit of work, the prototypes were isolated as individual character arrays ini- tialized at compile time. This permitted substituting the compile-time _s_i_z_e_o_f operation for the run-time _s_t_r_l_e_n. _L_e_t _t_h_e _c_o_m_p_i_l_e_r _d_o _t_h_e _w_o_r_k _w_h_e_n _p_o_s_s_i_b_l_e. At this point, profiling was turned off temporarily for speed tests. Profiling does impose some overhead. The speed trials showed that C _r_n_e_w_s was now running at over 15 times the speed of B _r_n_e_w_s. After months of adding frills, bunting and B 2.11 com- patibility*, Collyer again returned to performance tuning in August 1986. The 4.2BSD kernel on _u_t_c_s now included the 4.3BSD _n_a_m_e_i caches, which improve filename-lookup perfor- mance considerably. Unfortunately, considerations of crash recovery dictated some loss in performance: it seemed desir- able to put batch-file additions out by the line rather than by the block. _P_e_r_f_o_r_m_a_n_c_e _i_s _n_o_t _e_v_e_r_y_t_h_i_n_g. _G_p_r_o_f revealed that newsgroup name matching was an unexpected bottleneck, so that module was extensively tweaked by adding _r_e_g_i_s_t_e_r declarations, turning functions _________________________ * And supposed B 2.11 compatibility, as those who remember the short-lived cross-posting restrictions will recall. April 18, 1990 - 7 - into macros, applying _S_T_R_E_Q_N and such more widely, and gen- erally tuning the details of string operations. The code that handled _s_y_s-file lines got similar treatment next. The combination cut 40% off user-mode time. _P_e_r_s_i_s_t_e_n_t _t_u_n_i_n_g _o_f _k_e_y _m_o_d_u_l_e_s _c_a_n _y_i_e_l_d _l_a_r_g_e _b_e_n_e_f_i_t_s. Newsgroup matching remained moderately costly, and an investigation of where it was being used revealed two separate tests for a particular special form of name. It proved awkward to combine the two, so the testing routine was changed to remember having done that particular test already. _I_f _t_h_e _s_a_m_e _q_u_e_s_t_i_o_n _i_s _a_s_k_e_d _r_e_p_e_a_t_e_d_l_y, _m_e_m_o_r_i_z_e _t_h_e _a_n_s_w_e_r. By this time, the number of system calls needed to pro- cess a single article could be counted on one's fingers, and their individual contributions could be assessed. At one point it was desirable for a _c_r_e_a_t to fail if the file already existed, so this was being checked with a call to _a_c_c_e_s_s first. John Gilmore pointed out that on systems with a 3-argument _o_p_e_n (4.2BSD, System V), this test can be folded into the _o_p_e_n. The elimination of the extra name->file (_n_a_m_e_i) mapping cut both system time and real time by another 15%. (Note that this system _d_o_e_s have _n_a_m_e_i cacheing!) _F_i_l_e _n_a_m_e _l_o_o_k_u_p_s _a_r_e _e_x_p_e_n_s_i_v_e; _m_i_n_i_m_i_z_e _t_h_e_m. The development system (uuuuttttccccssss, a 750) is now filing 2-3 articles per second on average; _u_t_f_r_a_s_e_r (a Sun 3/160 with an Eagle disk) is typically filing 6-7 articles per second. C _r_n_e_w_s runs over 19 times as fast in real time as B _r_n_e_w_s, over 25 times as fast in system-mode CPU time, roughly 3.6 times as fast in user-mode CPU time, and over 10 times as fast in combined CPU times. With one exception (see _F_u_t_u_r_e _D_i_r_e_c_t_i_o_n_s), it now appears that very little can be done to speed up _r_n_e_w_s without changing the specifications. It seems to be execut- ing nearly the bare minimum of system calls, and the user- level code has been hand-optimised fairly heavily. _3. _E_x_p_i_r_e _P_e_r_f_o_r_m_a_n_c_e The rewrite of _e_x_p_i_r_e that started this whole effort was only partly motivated by performance problems. Perfor- mance was definitely bad enough to require attention, but the B _e_x_p_i_r_e of the time also had some serious bugs. Worse, the code was a terrible mess and was almost impossible to understand, never mind fix. Early efforts were directed mainly at producing a version that would _w_o_r_k; rewriting _e_x_p_i_r_e from scratch simply looked like the easiest route. Decisions made along the way, largely for other reasons, nevertheless produced major speedups. The first of these decisions was a reduction in the April 18, 1990 - 8 - scope of the program. B _e_x_p_i_r_e had several options for doing quite unrelated tasks, such as rebuilding news's his- tory file. The code for these functions was substantial and was somewhat interwoven with the rest. C _e_x_p_i_r_e adheres closely to a central tenet of the `Unix Philosophy': _a _p_r_o_- _g_r_a_m _s_h_o_u_l_d _d_o _o_n_e _t_a_s_k, _a_n_d _d_o _i_t _w_e_l_l. This may appear unrelated to performance, but better-focussed programs are generally simpler and smaller, reducing their resource con- sumption and making performance tuning easier (and hence more likely). In addition, a multipurpose program almost always pays some performance penalty for its generality. The second significant decision had the biggest effect on performance, despite being made for totally unrelated reasons. For each news article, the B news history file contained the arrival date and an indication of what news- groups it was in. This is _a_l_m_o_s_t all the information that _e_x_p_i_r_e needs to decide whether to expire an article or not. The missing* data is whether the article contains an expli- cit expiry date, and if so, what it is. B _e_x_p_i_r_e had to discover this for itself, which required opening the article and parsing its headers. A site which retains news for two weeks will have upwards of 5,000 articles on file. A few dozen of them will have explicit expiry dates. _B_u_t _B _e_x_p_i_r_e _o_p_e_n_e_d _a_n_d _s_c_a_n_n_e_d _a_l_l _5,_0_0_0+ _a_r_t_i_c_l_e_s _e_v_e_r_y _t_i_m_e _i_t _r_a_n! This was a performance disaster. We actually did not want to parse headers in _e_x_p_i_r_e at all, because the B news header-parsing code was (and is) complex and was known to contain major bugs. The perfor- mance implications of this were obvious, although secondary at the time. Header parsing is itself a non-trivial task, and accessing 5,000+ files simply cannot be made cheap. _I_n_f_o_r_m_a_t_i_o_n _n_e_e_d_e_d _c_e_n_t_r_a_l_l_y _s_h_o_u_l_d _b_e _k_e_p_t _c_e_n_t_r_a_l_l_y. The C news history file has the same format as that of B news, with one addition: a field recording the explicit expiry date, if any, of each article. If no expiry date is present in the article, the field contains `-' as a place- marker|-. In this way, the header parsing is done _o_n_c_e per _________________________ * Recent versions of B news have made some attempt to redress this lack, but haven't gone as far as C expire. The discussion here applies to the B expire that was current at the time C expire was written. |- It would be possible to simply compute a definitive expiry date for an article when it arrives, and record that. This would eliminate the decision-making overhead in _e_x_p_i_r_e, but would greatly slow the response to changes in expiry policy. Since one reason to change policy is time-critical problems like a shortage of disk space, this loss of flexibility was judged unacceptable. It is better to leave the expiry decision to _e_x_p_i_r_e and concentrate on making _e_x_p_i_r_e do April 18, 1990 - 9 - article, on arrival. In fact, the extra effort involved is essentially nil, since _r_n_e_w_s does full header parsing at arrival time anyway. _R_n_e_w_s had to be changed to write out the expiry date, and code which knew the format of the his- tory file had to be changed to know about the extra field. Perhaps a dozen lines of code outside _e_x_p_i_r_e were involved. A crude first version of C _e_x_p_i_r_e, incorporating these decisions in the most minimal way, ran an order of magnitude faster than B _e_x_p_i_r_e. Precise timing comparisons were not practical at the time, since the original motive for C _e_x_p_i_r_e was that B _e_x_p_i_r_e had stopped working completely, crippled by bugs in its header parsing. Later versions of B _e_x_p_i_r_e did cure this problem, but we were no longer interested in putting up slow, buggy software just to make an accurate comparison. Further work on C _e_x_p_i_r_e mostly concentrated on clean- ing up the hasty first version, and on incorporating desired features such as selective expiry by newsgroup. Selective expiry caused a small loss in performance by requiring _e_x_p_i_r_e to check the newsgroup(s) of each article against an expiry-control list. Here, _e_x_p_i_r_e benefitted from the work done to speed up the newsgroup-matching primitives of _r_n_e_w_s, since _e_x_p_i_r_e uses the same routines. _I_f _y_o_u _r_e-_i_n_v_e_n_t _t_h_e _s_q_u_a_r_e _w_h_e_e_l, _y_o_u _w_i_l_l _n_o_t _b_e_n_e_f_i_t _w_h_e_n _s_o_m_e_b_o_d_y _e_l_s_e _r_o_u_n_d_s _o_f_f _t_h_e _c_o_r_n_e_r_s|=. One improvement that was made late in development was in the format of the dates stored in the history file. B _r_n_e_w_s stored the arrival date in human-readable form, and _e_x_p_i_r_e converted this into numeric form for comparisons of dates. Date conversion is a complex operation, and the widely-distributed _g_e_t_d_a_t_e function used by news is not fast. Inspection of the code established that _e_x_p_i_r_e was the only program that ever looked at the dates in the his- tory file. There is some potential use of the information for debugging, but this is infrequent, and a small program that converts decimal numeric dates to human-readable ones addresses the issue. Both C _r_n_e_w_s and C _e_x_p_i_r_e now store the dates in decimal numeric form. _S_t_o_r_e _r_e_p_e_a_t_e_d_l_y-_u_s_e_d _i_n_f_o_r_m_a_t_i_o_n _i_n _a _f_o_r_m _t_h_a_t _a_v_o_i_d_s _e_x_p_e_n_s_i_v_e _c_o_n_v_e_r_s_i_o_n_s. Actually, C _e_x_p_i_r_e bows to compatibility by accepting either form on input, but outputs only the decimal form as it regenerates the history file. Thus, in the worst case, _e_x_p_i_r_e does the conversion only once for each history line, rather than once per line per run. ``_I_f _t_h_e_y _h_a_n_d _y_o_u _a _l_e_m_o_n, _m_a_k_e _l_e_m_o_n_a_d_e''. _________________________ it quickly. |= A corollary of this is: _k_n_o_w _t_h_y _l_i_b_r_a_r_i_e_s, _a_n_d _u_s_e _t_h_e_m. April 18, 1990 - 10 - If _e_x_p_i_r_e is archiving expired articles, it may need to create directories to hold them. This is an inherently expensive operation, but it is infrequently needed. How- ever, checking to see whether it _i_s in fact needed is also somewhat expensive... and the answer is almost always `no'. The same is true of checking to see whether the original article really still exists: it almost always does. (This cannot be subsumed under generic `archiving failed' error handling because a missing original is just an article that was cancelled, and does not call for a trouble report.) Accordingly, C _e_x_p_i_r_e just charges ahead and attempts to do the copying. Only if this fails does _e_x_p_i_r_e analyze the situation in detail. _C_a_r_r_y_i_n_g _a _n_e_t _i_n _f_r_o_n_t _o_f _y_o_u _i_n _c_a_s_e _y_o_u _t_r_i_p _i_s _u_s_u_a_l_l_y _w_a_s_t_e_d _e_f_f_o_r_t. Archiving expired articles often requires copying across filesystem boundaries, since it's not uncommon to give current news and archived news rather different treat- ment for space allocation and backups. Copying from one filesystem to another can involve major disk head movement if the two filesystems are on the same spindle. Since head movement is expensive, maximizing performance requires get- ting as much use as possible out of each movement*. _E_x_p_i_r_e is not a large program, and even on a small machine it can spare the space for a large copying buffer. So it does its archiving copy operations using an 8KB buffer. _B_u_y_i_n_g _i_n _b_u_l_k _i_s _o_f_t_e_n _c_h_e_a_p_e_r. Since 8KB accommodates most news articles in one gulp, there is little point in enlarging it further. _T_h_e _l_a_w _o_f _d_i_m_i_n_i_s_h_i_n_g _r_e_t_u_r_n_s _d_o_e_s _a_p_p_l_y _t_o _b_u_y_- _i_n_g _i_n _b_u_l_k. Since _e_x_p_i_r_e is operating on the history file at (potentially) the same time that _r_n_e_w_s is adding more arti- cles to it, some form of locking is necessary. Given that _e_x_p_i_r_e has to look over the whole database of news, and typ- ically has to expire a modest fraction of the articles, it is a relatively long-running process compared to _r_n_e_w_s. Contention for the history-file lock can be minimized by noting that _r_n_e_w_s never does anything other than append to the file. So _e_x_p_i_r_e can leave the file unlocked while scan- ning it; the contents will not change. When (and only when) _e_x_p_i_r_e reaches end-of-file, it locks the news system, checks for and handles any further entries arriving on the end of the history file meanwhile, and finishes up. _L_o_c_k_i_n_g _d_a_t_a _t_h_a_t _w_o_n'_t _c_h_a_n_g_e _i_s _w_a_s_t_e_f_u_l. After careful application of these various improve- ments, C _e_x_p_i_r_e is fast enough that further speedup is not worth much effort. However, an analysis of where it spends _________________________ * As witness the progressive increase in filesystem block size that produced major performance improvements in successive versions of 4BSD. April 18, 1990 - 11 - its time does suggest one area that might merit attention in the future. _E_x_p_i_r_e rebuilds the history file to reflect the removal of expired articles. The history file is large. _E_x_p_i_r_e must also rebuild the _d_b_m indexing data base, since it contains offsets into the history file. This data base is comparable in size to the history file itself, and is generated in a less orderly manner that requires more disk accesses. Much of the time needed for these operations could be eliminated if _e_x_p_i_r_e could mark a history line as `expired' without changing its size. This could be done by writing into the history file rather than by rebuilding the whole file, and the indexing database would not need alteration. This would also permit retaining information about an arti- cle after the article itself expires, which would simplify rejecting articles that arrive again (due to loops in the network, etc.) after the original has expired. The history file should still be cleaned out, and the indexing database rebuilt, occasionally. C _e_x_p_i_r_e contains some preliminary `hooks' for this approach, but to date full implementation does not seem justified: C _e_x_p_i_r_e is already fast enough. _K_n_o_w _w_h_e_n _y_o_u _a_r_e _f_i_n_i_s_h_e_d. _4. _B_a_t_c_h_e_r _P_e_r_f_o_r_m_a_n_c_e The C batcher is descended from a very old version written to add some minor functionality that was not present in the B batcher of the time. It is small and straightfor- ward, and contains only a couple of noteworthy performance hacks. The batcher works from a list of filed articles, to be composed into batches. The list is by absolute pathname. All of these files reside in the same area of the system's directory tree, and referring to them with absolute path- names every time implies repeatedly traversing the same ini- tial pathname prefix. To avoid this, the batcher initially _c_h_d_i_rs to a likely-looking place such as /_u_s_r/_s_p_o_o_l/_n_e_w_s. Thereafter, before using an absolute pathname to open an article, it checks whether the beginning of the pathname is identical to the directory where it already resides. If so, it strips this prefix off the name before proceeding. _I_f _y_o_u _w_a_l_k _t_h_e _s_a_m_e _r_o_a_d _r_e_p_e_a_t_e_d_l_y, _c_o_n_s_i_d_e_r _m_o_v_i_n_g _t_o _t_h_e _o_t_h_e_r _e_n_d. The batcher's input is usually in fairly random order, with little tendency for successive files to be in the same directory. If this were not the case, it would be worthwhile for the batcher to actually move around in the directory tree to be closer to the next file. The batcher used to copy data using _p_u_t_c(_g_e_t_c()) loops. This has been replaced by _f_r_e_a_d/_f_w_r_i_t_e which is April 18, 1990 - 12 - significantly faster, especially if using the souped-up _f_r_e_a_d/_f_w_r_i_t_e mentioned earlier. _I_f _y_o_u _n_e_e_d _t_o _m_o_v_e _a _m_o_u_n_- _t_a_i_n, _u_s_e _a _b_u_l_l_d_o_z_e_r, _n_o_t _a _t_e_a_s_p_o_o_n. _5. _F_u_t_u_r_e _D_i_r_e_c_t_i_o_n_s The one improvement we are still considering for _r_n_e_w_s is a radical revision of the newsgroup-matching strategy. Newsgroup matching still consumes about 18% of user-mode processor time. The key observation is that the information that determines which newsgroups go to which sites seldom changes. It would probably be worth precompiling a bit array indexed by newsgroup and site, and recompiling it only when the _a_c_t_i_v_e file or the _s_y_s file changes in a relevant way. This would cut the newsgroup-matching time to essen- tially zero. _R_n_e_w_s would be faster (and simpler) if `Newsgroups:' and `Control:' were required to be the first two headers (if present) of each article. At present _r_n_e_w_s tries to find them before starting to write the article out, so that it can put the article in the right place from the start, but it has to allow for the possibility that vast volumes of other headers may precede them. Hashing _a_c_t_i_v_e-file lookups in _r_n_e_w_s would be fun, but profiling suggests that it's not worthwhile unless the number of newsgroups is in the thousands. When PDP-11's are truly dead on Usenet, the use of large per-process memories _m_a_y allow further speedups to _r_n_e_w_s by reading the entire batch into memory at once and writing each article to disk in a few _w_r_i_t_e_s (it can't easily be reduced to a single _w_r_i_t_e because headers must be modified before filing). One optimization we have _n_o_t considered is re-coding key parts in assembler. C news already runs on five dif- ferent types of machine. Use of assembler would be a maintenance nightmare, and probably would not yield benefits comparable to those of the more high-level changes. _6. _A_c_k_n_o_w_l_e_d_g_e_m_e_n_t_s Ian Darwin ran the very earliest alpha versions of _r_n_e_w_s and gave helpful feedback. Mike Ghesquiere, Dennis Ferguson and others have run later versions and prodded Col- lyer to fix or implement assorted things. John Gilmore and Laura Creighton read and criticized an early alpha version of _r_n_e_w_s. _7. _A_p_p_e_n_d_i_x: _r_n_e_w_s _T_i_m_e_s Measurements have been taken on a VAX 750 running April 18, 1990 - 13 - 4.2BSD under generally light load, using a batch of 297,054 bytes of net.unix-wizards containing 171 articles and ~104 cross-postings. All times are in seconds per article. time real user sys comments _____________________________________________________________________ 85 Dec 6 00:54 4.68 0.3 1.29 BBBB nnnneeeewwwwssss rejecting all. (b.1.rej) 85 Dec 6 00:54 3.184 0.69 0.67 first timing trial; _p_r_o_f_i_l_i_n_g _o_n (c.1) 85 Dec 6 00:54 0.66 0.175 0.199 rejecting all (c.2.rej) 85 Dec 6 03:25 0.58 0.175 0.175 still rejecting all (c.3.rej) 85 Dec 6 23:46 9.058 0.631 2.251 BBBB nnnneeeewwwwssss using private directories, rejected 53 of the 171 articles as "too old" (b.2) 85 Dec 7 00:24 - - 2.0 (est) on a 10 MHz 68000 with slow memory and slow disk (crude timings) (c.darwin.1) 85 Dec 7 00:40 7.576 0.684 2.403 BBBB nnnneeeewwwwssss without the "too old" reject code and having cleared out history (b.3) 85 Dec 7 04:43 1.99 0.49 0.53 accepting the articles, using read and write for bulk copies (c.4) 85 Dec 7 06:10 0.497 0.449 2.261 (!) optimized by less locking & keeping batch files open (c.5) 85 Dec 7 07:32 1.383 0.491 0.414 same as the last one, but with a lower load average (around 1.5) (c.6) 85 Dec 16 03:43 1.380 0.447 0.374 for calibration after misc. cleanup (c.7, c.8) 86 Jan 13 00:23 1.232 0.349 0.301 turned hostchar() into a macro (c.9) 86 Jan 13 04:26 1.36 0.333 0.242 (!) using in-core active file, under heavy load (c.10) 86 Jan 13 08:24 1.94 0.349 0.253 using in-core sys file too, under heavy load. Re-run this trial! (c.11) 86 Jan 13 08:42 0.332 0.245 0.892 (!) re-run at better nice. Not strik- ing, except for real time. Was run in a large directory; ignore. (c.12) 86 Jan 13 08:59 0.333 0.861 (!) 0.212 (!) re-run at good nice & in a small directory. Have beaten B news by _o_n_e _o_r_d_e_r _o_f _m_a_g_n_i_t_u_d_e on real & sys times! Beat it by more than twice on user time. (c.13) 86 Jan 21 19:15 1.208 0.349 0.245 creat 1st link under final name, only link to make cross-postings; with HDRMEMSIZ too small (c.14) 86 Jan 21 19:57 0.728 0.318 0.193 previous mod, with HDRMEMSIZ of 4096 (c.15) April 18, 1990 - 14 - 86 Jan 22 01:20 0.719 0.315 0.166 fewer opens (just rewind the spool file), but Xref(s): not working (c.16) 86 Jan 22 01:53 0.314 0.637 (!) 0.154 (!) fewer opens fixed to spell Xref: right; Xref: not working (c.17) 86 Jan 22 04:00 0.874 0.325 0.174 fewer opens with Xref: fixed (times may be high due to calendar) (c.18) 86 Jan 22 05:45 0.694 0.309 0.159 under lighter load, times are better (c.19) 86 Jan 24 04:29 0.715 0.317 0.129 (!) turn creat & open into just creat, under slightly heavy load (c.20) 86 Jan 24 06:06 0.628 (!) 0.288 (!) 0.129 (!) reduce number of calls on index (by noting line starts at the start) and strncmp (via macro) in active.mem.c, but still profiling and writing stdout and stderr to the tty (c.21) 86 Jan 24 07:22 0.653 0.209 (!) 0.123 (!) fewer strlen calls (by using ssssiiiizzzzeeeeooooffff s - 1), writing stdout to /dev/null and with _p_r_o_f_i_l_i_n_g _o_f_f, but under moderate load; try again (c.23) 86 Jan 24 07:35 0.574 (!) 0.216 (!) 0.123 (!) as last time, but stdout to tty(!) and under light load. _r_u_n_n_i_n_g _1_5._6_7 _t_i_m_e_s _a_s _f_a_s_t _a_s _B _r_n_e_w_s (c.24) 86 Aug 8 04:23 0.839 0.51 0.124 performance hit: fflush after each history line for crash-resilience; run for _g_p_r_o_f output and calibra- tion with later runs. running under 4.2.1BSD (has 4.3 namei cache) now. real and user times are way up; due to gprof profiling? (c.25) 86 Aug 8 04:24 0.962 0.131 0.438 (!) run with faster ngmatch, with rrrreeeeggggiiiisssstttteeeerrrr decl.s and wordmatch and STREQN macros; saved 15% user. User time is better than c.25, but still up from c.24. (c.26) 86 Aug 10 07:35 0.805 0.135 0.345 (!) further speedups: ngmatch has more rrrreeeeggggiiiisssstttteeeerrrr decl.s and in-line index; more use of STREQ(N) macro for str(n)cmp in hdrmatch, ngmatch.c and transmit.c; faster ishdr without index. real & user times are better than both c.26 and c.25 (c.27) 86 Aug 11 04:19 1.012 0.146 0.303 (!) rewrote sys.c, used INDEX and STREQ(N) macros throughout rnews. real and sys times are up, but user continues to decline. (c.28) 86 Aug 12 03:51 1.315 0.315 0.154 minor tweaks: all.all.ctl caching, etc. (c.29) April 18, 1990 - 15 - 86 Aug 30 17:56 0.564 0.189 (!) 0.112 (!) light load, thought we had 3-arg open in fileart, but didn't. Odd. _S_t_o_p_p_e_d _u_s_i_n_g _g_p_r_o_f. (c.30) 86 Aug 30 17:57 0.191 0.475 (!) 0.095 (!) Really and truly use the 3-arg open. _1_9 _t_i_m_e_s _B _r_n_e_w_s _s_p_e_e_d. (c.31) April 18, 1990