10 Jan 1997

HTTP

The HyperText Transfer Protocol (HTTP) defines how the Browser sends requests and how the Web Server replies. It is a very simple system in which all the data is sent as plain character text.

The Browser generates a request when the user types in a URL or clicks on a phrase or image linked to a URL. The URL contains the protocol name ("http:"), a server name ("pclt.cis.yale.edu"), and the name of a resource ("pclt/default.htm"). Most of the time, the resource is a file, but it is up to the Server to figure out what the name means.

The HTTP rules tell the Browser to establish an Internet connection to the Server named in the URL. The same machine may provide many services, each identified by a port number. The default for a Web Server, if no other value is specified, is to use port 80.

The Request

The Browser then sends a request in the form of a stream of ordinary text. The first line of text contains a verb (usually "GET" or "POST"), the resource name("pclt/default.htm"), and the version of the protocol ("HTTP/1.0"). Subsequent lines contain "header information" in the form of an attribute name, a colon, and some value. The request ends with a blank line. Header lines are optional, but a typical request from a modern browser might have the form:

GET pclt/default.htm HTTP/1.0
accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
user-agent: Mozilla/3.01Gold (WinNT; I)"
connection: Keep-Alive

The "accept" header indicates that when an image is available in more than one format, the browser would prefer to receive it as a GIF or JPEG file. The "user-agent" header identifies the Browser as Netscape Navigator 3.01 running under Windows NT on an Intel machine. The "connection" header offers to obey the "keep-alive" protocol, where the network session is reused for subsequent requests. Other headers can specify the preferred language (French, Spanish, English) for documents, or provide the timestamp of a previous copy of the file that the Browser holds in its cache, suggesting that the file be resent only if a newer copy is available.

The Server receives this request and extracts the resource name (pclt/default.htm) from the first line. The resource could be the name of a file, which the server will read off disk and send back to the Browser. Or the resource could be a program, which the Server will run and then send the program output back to the Browser.

The Response

HTTP defines the format of the Server response to be similar to the original request. Again the first line has three fields, but now the protocol comes first instead of last. There is also a status code and status message. Again, there are optional header lines, each with an attribute name, a colon, and an attribute value. The headers end with the first blank line. This is immediately followed with the data from the file or the output from the program:

HTTP/1.0 200 OK
Server: Netscape-Communications/1.12
Last-modified: Friday, 23-Feb-96 14:54:39 GMT
Content-length: 88
Content-type: text/plain

This is a simple text file placed in the first document directory
of the PCLT server.

A response code of 200 means that the request was processed normally. Codes in the 30x range indicate that the resource name cannot be found or the resource has moved. Codes in the 40x range generally indicate that you are not permitted to view the resource. Codes in the 50x series indicate a programming error in the Server.

The "Content-type" header provides the MIME code for a particular type of data. Here the "text/plain" code means that the file is ordinary unformatted text. The most common type is "text/html" for a formatted Web document. A Web server can use HTTP to send images, audio, compressed archives, Java classes, or any other type of data. The response begins with the status line and header lines in plain text, but after the first blank line that ends the headers, the rest of the response can be 8-bit binary data.

If the Server provides a "Content-length" header line, then the Browser knows exactly how many bytes of data to read. Various extensions of HTTP 1.0 will use this information to keep the Internet session active so that a new request can be sent over it. If the Server is unable to provide the length of the data, then the Browser must read bytes until the Server indicates the end of the data by terminating the session (so the Browser gets an "end of file"). A new request will then require a new session.

This is important because few HTTP operations involve only a single exchange. A formatted Web document can contain HTML references to image files that provide illustrations, icons, or even background patterns. It can also have references to Java Applets, which may in turn require additional Java Class files to execute. A Microsoft Browser may contain references to an ActiveX object. After the initial HTML file is loaded, the Browser will come back to the Server to load any files required by these embedded references. If the session can remain alive and be reused, this sequence of downloads will be faster.

Running Programs

The resource named in the HTTP request may refer to a program. HTTP does not specify how the Server determines this. In some cases, there are special directories that contain only programs. It is a convention that "cgi-bin/" is a dummy directory that contains ordinary executable programs, typically written in C or Perl. The Netscape Server reserves a special dummy directory name "server-java/" for Java programs that run on the Server. In other cases, the Server decides based on file type. The Microsoft IIS server recognizes a file with a "*.DLL" extension as an ISAPI program, while a file that ends in "*.ASP" is an IIS 3.0 Active Server Page, an HTML file that may contain embedded VB Script or JavaScript programming.

No matter what type of program the Server decides to run, the only information available is going to come from the request text stream. Primarily, this means the resource name in the first request line, and the header lines that follow. There are two additional fields, however, that are frequently used to communicate information to a Server-side program.

A string of program parameters, commonly called the "query string" can be appended to the program name following a "?" character. The only constraint that HTTP places on the format of this string is that it cannot contain blanks. There are two conventions that suggest its possible format:

On many systems, the query string is limited to 255 characters. When more data must be passed to the program, an alternative is provided by the POST protocol. POST is an alternative verb, replacing GET on the first line of the request. POST indicates that the request contains additional input data. To use POST, the request must contain a header line with the same "Content-length" field specified in a typical reply. As with the reply data, the "Content-length" in the request indicates the number of bytes of (possibly binary) data that follows the first blank line ending the sequence of header lines. There can also be a "Content-type" field indicating the format of this data. Again there are two conventions that suggest, but do not require, a particular use for this data.

A program could be written to accept data in any format. One could imagine the use of POST data to carry a SQL query to a database:

POST /dbms/query.exe HTTP/1.1
content-type: text/plain
content-length: 117

SELECT name, day, start_time, length, channel
FROM tv_program_table
WHERE type="SITCOM"
ORDER by day, start_time

Semi-Stateless

Any Keep-Alive convention to reuse the same session for more that one request doesn't change the basic HTTP model that every request and response stand alone, independent of previous or future exchanges. A Web page can point to many different documents and programs on many different servers. When a program runs on a Web Server, it cannot make any assumptions about how the Browser came to make the request.

Netscape proposed, and the industry has generally accepted, a tool to provide a small amount of history. It is the "cookie" protocol. A Web Server can include a "Set-cookie:" header in any HTTP response. The cookie data consists of a sequence of varname=value pairs separated by semicolons. The Browser stores this information in a disk file, and returns it as a "Cookie:" header field in any subsequent request directed to another program in the same directory of the same Web Server.

Since the cookie information is stored in a disk file, it can be viewed, altered, or deleted by the user. This is not a safe place to store sensitive information. However, it can provide some history, and it can hold information so the application doesn't have to ask the user the same silly questions over and over when he revisits the site. It can also hold a customer number or other ID field that can be used to locate information in a Server-side database, where the application can securely store information that the user cannot alter.

It's Batch-like

Corporate information systems have either been based on interactive 3270 terminal sessions, or formally structured interactive transaction processing protocols (technically known as APPC or LU6.2). In either case, a Server receiving a request could ask questions of the client to fill in missing pieces or refine the request.

In HTTP, a request or response must be complete and self-contained. The Request contains the name of the program and all the parameter data needed for the program to run.

It is also true, but somewhat less obvious, that a Response from the Server can also indicate programs to run and the parameters needed to run them. This occurs when an HTML document contains a reference to a Java Applet or defines an ActiveX object. In both cases, the HTML tags provide the ID of a program to run, a location to fetch a copy of the program if the Browser doesn't already have it, and a sequence of varname=value pairs that are passed to the program when it starts up. In this case, the program that generates the Response will generally have ended before any of the programs that it identifies start up on the Browser.

HTTP behaves like a sequence of batch jobs exchanged between two mainframes. A batch job is, after all, a text file that provides the name of a program to run on a machine, along with any parameters it needs and any input data. In this case, when the program is done it generates a new batch job which is sent back to the original machine, which in turn generates a new batch job that is sent back to the other computer, and so on.

This is completely different from the communications models for timesharing or conventional transaction processing. It requires an entirely different structure to the application. This explains why most Web application design is Object Oriented. Requests and responses can be mapped to the "events" and "messages" of an object oriented design, and since no information is automatically saved between exchanges, history data and state have to be explicitly packaged in Objects that are saved and restored using Cookies or other heuristics.

Continue Back PCLT

Copyright 1996 PC Lube and Tune -- Distributed Applications and the Web H. Gilbert