home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1997 December
/
Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso
/
drafts
/
draft_n_r
/
draft-pritchard-http-links-00.txt
< prev
next >
Wrap
Text File
|
1996-11-21
|
21KB
|
508 lines
Internet Draft John Pritchard
<draft-pritchard-http-links-00> Columbia U Computer Science
Expires June 1996 21 November 1996
Efficient HyperLink Maintenance for HTTP
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working documents of
the Internet Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and
may be updated, replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet- Drafts as reference material or to cite them
other than as ``work in progress.''
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).
Distribution of this document is unlimited. Please send comments to John
Pritchard at <jdp@cs.columbia.edu>
Abstract
Hyperlink maintenance allows robots and servers to cooperate in propagating
the effects of daily changes in the millions of resource locations in the
wwweb. Here, we propose developing the definitions of the LINK and UNLINK
methods defined for HTTP since RFC 1945 and which remain largely
unimplemented and unused. We believe that the only reason these methods have
not been employed is that they remain too loosely defined and implicitly too
inefficient. A new syntax and semantics simplify implementation and improve
utility.
Author's address
John Pritchard
315 W 82nd Street, #4
New York, NY 10024
<jdp@cs.columbia.edu>
Contents
1. Introduction
2. Link Terminology
3. Implementation Terminology
4. Current HTTP Link Management Protocol
5. Some linking practices
6. Proposed Facility
7. Methods
1. LINK
2. UNLINK
3. UNLINKR
4. LINKMOD
8. Implementation
9. Indempotency
10. Security Considerations
11. Syntax
12. References
1. Introduction
The HTTP protocol has recognized the importance of link management
since HTTP/1.0 RFC 1945 [1]. However, the methods defined in HTTP/1.0
are limited and remain largely unimplemented. The existing link concept
is defined irrespective of direction, ie, reference or resource, and so
leaves too much semantically implied. The revised methods define simple
and efficient syntax and semantics for a complete hyperlink management
protocol within HTTP.
Dangling links are a bigger and bigger problem on a large and growing
wwweb. Messages like the following are common:
The URL which you entered, ... , was not found on this server.
You may have entered it incorrectly, or it may no longer exist.
If you arrived here by clicking on a link in another page,
please tell that page's owner/administrator that the link no
longer exists.
This one resulted from a URL stored in a popular search engine. A
solution is readily available in defining HTTP's LINK and UNLINK
methods with syntax and semantics that effectively and efficiently
provide for hyperlink maintenace.
Hyperlink maintenance implies communication, processing and storage
costs. The proposed methods cut processing with syntax by not defining
semantics that imply searching on behalf of call receivers. The
proposed methods' semantics also match storage requirements to the HTML
LINK tag concept. Storage space is not required on behalf of robots for
implementation.
The protocol detailed here is currently being implemented in an
HTTP/1.1 compliant, commercial wwweb server and agent platform under
the extensions provisions of that specification. This protocol has been
realized as the result of that effort.
2. Link Terminology
In this context we refer exclusively to links that are Uniform Resource
Locators, see URL [4] and [2]. URLs are Uniform Resource Indentifiers,
URIs [1], pointing to particular resources without variation per user
identity, class or input, or other particularly perishable or localized
circumstances.
A link has two end points, one in an HTML anchor or otherwise a URL
reference, and the other in the HTTP service providing access to a
resource via a reference. The source end of a link is the client or
anchor end, sometimes the tail, and the target end of a link is the
resource end, sometimes the head.
source: anchor, reference, tail
target: resource, head, server, named anchor
Usage for source and target include direct reference to documents, or
reference locators (URLs), or the services (hosts) at the respective
ends of a link.
For discussing efficiency, we describe a shorter URI as coarser, and a
longer one finer. The comparison could be made for URIs into the same
sub-wwweb, for example
http://www.target.com/some/long/path/ A
http://www.target.com/some/path/ B
B is coarser than A. If a coarser URI replaces a finer one, the
implication of clobbered namespaces arises as well as a greater
potential need for link modifications. Remember that handling URLs, or
particular resource locators, implies that for each link there's an
unlink.
3. Implementation Terminology
In agreement with the HTTP specification documents and RFC 1123 [1], we
employ must, shall or required to indicate implementation syntax or
semantics that are not optional for software conforming to this
specification, may for recommended features and should for optional
features.
Please note that this draft does not constitute a modification of any
standard, rfc, or draft document but a proposal for review by the HTTP
Working Group and the internet administration and development
community.
4. Current HTTP Link Management Protocol
The LINK and UNLINK methods are described in HTTP/1.1 [2] draft seven,
sections 19.6.1.2 and 3, respectively. In short, the link and unlink
request lines include method names and a request URI.
The specification [2] states (section 5.3)
The LINK method establishes one or more Link relationships
between the existing resource identified by the Request-URI
and other existing resources.
The UNLINK method removes one or more Link relationships
from the existing resource identified by the Request-URI.
These relationships may have been established using the
LINK method or by any other method supporting the Link
header. The removal of a link to a resource does not imply
that the resource ceases to exist or becomes inaccessible
for future references.
Without providing both the source and target of a link for LINKing or
UNLINKing, the processing requirements for implementation of the
current methods imply looking up the other end of the link. Link source
or unlink target information is required in request headers, or on the
request line to allow a valuable optimization -- eliminating excess
searching or indexing.
5. Some linking practices
Hyperlink maintenance methods are required for wwweb organization and
must be interoperable across wwweb servers and robots in order to be
effective. Robots and wanderers maintain catalogs of URI references and
hypertext. Currently, unlink maintenance of these catalogs is largely
manual. The Robot Exclusion Standard or "/robots.txt" [6] is currently
considering a new facility for informing robots of changes to a
server's sub-web, but doesn't address the server to server case that
most links fall into. The passive existance of a link directive
instrument on a server would require every server to get the linking
directives from every other server and apply them heuristically to try
to weed out broken links. This is untenable for broad use by
communication and processing requirements and by the complexity of
implementation. RES is useful for directing searches on subwebs by
robots and is fairly widely employed by search engines and other
robots.
The URN [7] proposal is another idea that is sometimes mentioned but
really isn't relevant. It creates a hierarchical global namespace for
resources, and is designed for resources with extensive lifetimes, and
not the ordinary class of information. Named linking would be extremely
useful for putting hyperlinks into this document for reference
material. With a particular URN namespace, the reader would potentially
find the closest copy, perhaps a local copy of an RFC or Internet-Draft
document, rather than simply use the link provided to the USA East
Coast repository provided here. But even URN may not be appropriate for
drafts with six month lifetimes.
WWWeb meta information and versioning are important in this context as
the proposed link maintenance extensions could benefit from mutual
implementation in a wwweb server's object management system in
conjunction with "Version management with meta-level links via
HTTP/1.1" [3]. Content level links (see "Link" content header in
HTTP/1.0 [2] and LINK entity in HTML 2.0 [8]) provide a default storage
mechanism for link maintenance information.
6. Proposed Facility
Required semantics are very limited. Only support for the LINK call,
and clean disposal of other calls, is required by implementing systems.
This simple, lightweight form doesn't require storage overhead on
robots, crawlers, etc..
The cost of employing this automation is lower than might first be
imagined as link changes with coarser effects are rarer than link
changes with finer effects. Unlinks potentially occur for each link,
without matching coarse URIs into fine URLs.
If the wwweb server maintains a table of LINKs for the target document,
it can issue UNLINKs to delete or revise others' information when the
location changes or is deleted. So the average cost in simple network
calls and table size is linear in number of links. Unlink calls'
generation versus link calls' receipt ratio depends entirely on the
server site characteristics.
The table for a particular doc.html would store link source info, or
reverse links. The UNLINK call is made to the host in the source end of
the link, with the source and target links so that it can handle the
request with minimal overhead. The LINK call is made to the host
serving the target when the reference locator is used in a link-source
document.
Although HTML [7] defines LINK entities, in practice one doesn't want
the wwweb server to download its link set with each HTML document -- if
for no other reason than minimizing general bandwidth consumption.
7. Methods
1. LINK
Linking provides for subsequent link modifications from the target
to the source. Links change at their target side, so the link
establishment between two HTTP implementing systems needs to allow
the target side to tell the source side when a link URL has
changed.
The LINKMOD option tells the target end of the link that LINKMOD
calls should be made to the source end.
The target maintains a table of source links associated with
particular resources so that if their URIs change the target can
notify the source.
LINK Source-URL Target-URL
LINK Source-URL Target-URL LINKMOD
Request
The source tells the target that a URL to the target has
been stored at the source.
Reply
The target will accept LINK calls with 200 Ok unless the
Target-URL is invalid. In this case it will respond with a
417 Invalid target URI. If the LINKMOD option is requested
but not enabled, the 207 No Linkmod reply will be generated.
2. UNLINK
UNLINK removes previous LINK information. A source tells a target
that the previous source referenced in a prior LINK call no longer
exists or has moved.
UNLINK Source-URL Target-URL
UNLINK Source-URL Target-URL Repl-Source-URL
Request
The source notifies the target that the source link has
changed. Optionally, the source may specify a replacement
source URL.
Reply
The target replies with 200 Ok unless the source has
specified invalid source or target URLs. In the case of
erroneous source or target URIs, the target replies with one
of 416 Invalid source URI or 417 Invalid target URI. The
invalid target may indicate only that UNLINKR has not been
supported by the target or source system. The invalid source
reply occurs when there is no such source link information
known to the target.
3. UNLINKR
This method allows the target to inform the source that a link has
changed. It specifies that the first argument refers to a source
link that it stores and the second argument refers to a target
link from that source. It is redundant on the semantics of the
UNLINK method if the semantics of the UNLINK method included
determining whether the recipient of the call is the source or the
target.
For UNLINK, the receiver is the target end, and with UNLINKR, the
receiver is the source end.
UNLINKR Source-URL Target-URL
UNLINKR Source-URL Target-URL Repl-Target-URL
Request
The target notifies the source that the Target-URL
referenced from location Source-URL is no longer valid. The
target optionally provides the source with a replacement
target URL.
Reply
The source replies with 200 Ok unless the target has
specified invalid target or source URLs. In the case of
erroneous target or source URIs, the source replies with one
of 416 Invalid target URI or 417 Invalid source URI. The
invalid source may indicate only that UNLINK has not been
supported by the source or target system. The invalid target
reply occurs when there is no such target link information
known to the source.
4. LINKMOD
A LINKMOD call could notify robots that a page has been updated.
this would require that LINK be extended with optional request for
LINKMOD calls.
LINKMOD would be accepted by robots and crawlers in addition to
UNLINK. The source will react according to its need for this
information.
LINKMOD Source-URL Target-URL
Request
The target informs the source that the Target-URI has
been modified.
Reply
The source replies with 200 Ok unless the target has
specified invalid target or source URLs. In the case of
erroneous target or source URIs, the source replies with one
of 416 Invalid target URI or 417 Invalid source URI. The
invalid source may indicate only that UNLINK has not been
supported by the source or target system. The invalid target
reply occurs when there is no such target link information
known to the source.
8. Implementation
We can divide all classes of HTTP-implementing software into two
categories for specifying implementation requirements. The first is the
class of systems that maintain no link references (no HTML or URL
catalogs) in their internal data. These have no implementation
requirements.
The second is systems that maintain link references in HTML or URL
catalog data. These include wwweb servers and search engines.
The implementation must include LINK and may implement UNLINK, UNLINKR
and LINKMOD. If it is only implementing LINK, it must reply with an Ok
status code to any UNLINK, UNLINKR and LINKMOD calls it receives.
9. Indempotency
All of these methods are indempotent. Successive identical calls have
identical effect as a single call. However, this requires that LINK is
implemented to not replicate identical data. Please refer to RFCs 1738
[4] and 1808 [5] and HTTP/1.1 [2] Section 3.2.3 "URI Comparison" for
information on determining when a LINK request should be discarded in
preserving indempotency.
10. Security Considerations
The UNLINK and UNLINKR methods' calls should be manually reviewed or
automated and secured for trusted or authenticated hosts.
At least robot-level spamming would be segmented into LINKMOD domain
until people used UNLINK <target> <target> or the variation based on
replicating pages, ie, UNLINK <target> <copy of target>.
11. Syntax
The syntax employs an induction operator, "=" (parser), and a deduction
operator ":" (compiler). Literals are double quoted. Alternatives
succeed "|". Where noted in ";" line comments, a syntactic variable may
be defined in HTTP/1.1 [2]. Two linebreaks terminate a clause, any
amount of whitespace is identical to a single token separator.
Method = "LINK"
| "UNLINK"
| "UNLINKR"
| "LINKMOD"
Request = Link-Request-Line
| Unlink-Request-Line
| UnlinkR-Request-Line
| LinkMod-Request-Line
*( general-header ) ; HTTP/1.1 07 4.5
CRLF
Link-Request-Line
= "LINK" Source-URL Target-URL
| "LINK" Source-URL Target-URL "LINKMOD"
Unlink-Request-Line
= "UNLINK" Source-URL Target-URL
| "UNLINK" Source-URL Target-URL Repl-Source-URL
UnlinkR-Request-Line
= "UNLINKR" Source-URL Target-URL
| "UNLINKR" Source-URL Target-URL Repl-Target-URL
LinkMod-Request-Line
= "LINKMOD" Source-URL Target-URL
Source-URL : URL ; RFC 1738 Resource Locator
Target-URL : URL
Repl-Target-URL
: URL ; Suggested Link Replacement
Repl-Source-URL
: URL ; Suggested Link Replacement
Response = Status-Line ; As HTTP/1.1
Status-Code = "200" ; Ok
| "207" ; No Linkmod
| "400" ; Bad Request
| "404" ; Not found
| "416" ; Invalid source URI
| "417" ; Invalid target URI
| "500" ; Internal Server Error
12. References
1. Hypertext Transfer Protocol -- HTTP/1.0
rfc1945
T. Berners-Lee, R. Fielding, H. Frystyk
May 1996
2. Hypertext Transfer Protocol -- HTTP/1.1
draft-ietf-http-v11-spec-07
R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee
August 1996
3. Version management with meta-level links via HTTP/1.1
draft-ota-http-version-00
K. Ota, K. Takahashi, K. Sekiya
November 1996
4. Uniform Resource Locators (URL)
rfc1738
T. Berners-Lee, L. Masinter, M. McCahill
December 1994
5. Relative Uniform Resource Locators
rfc1808
R. Fielding
June 1995
6. Robot Exclusion Standard
norobots.html
Martijn Koster
7. A Framework for the Assignment and Resolution of Uniform Resource
Names
draft-daigle-urnframework-00
Leslie L. Daigle
June 1996
8. Hypertext Markup Language - 2.0
draft-ietf-html-spec-06
T. Berners-Lee, D. Connolly
September 1995