Basics of Web – Minimal Guide

Hosting a web site isn’t substantially different from providing any other network service. A daemon listens for connections on TCP port 80 (the HTTP standard), accepts requests for documents, and transmits them to the requesting user’s browser. Many of these documents are generated on the fly in conjunction with databases and application frameworks, but that’s incidental to the underlying HTTP protocol.

Resource locations on the web

Information on the Internet is organized into an architecture defined by the Internet Society (ISOC). This well-intended (albeit committee-minded) organization helps ensure consistency and interoperability throughout the Internet. ISOC defines three primary ways to identify a resource: Uniform Resource Identifiers (URIs), Uniform Resource Locators (URLs), and Uniform Resource Names (URNs).

So what’s the difference?

URLs tell you how to locate a resource by describing its primary access mechanism (e.g., http://example.com).
URNs identify (“name“) a resource without implying its location or telling you how to access it (e.g., urn:isbn:0-13-020601-6).

When do you call something a URI? If a resource is only accessible through the Internet, refer to it as a URL. If it could be accessed through the Internet or through other means, then you’re using a URI.

Uniform resource locators

Most of the time, you’ll be dealing with URLs, which describe how to access an object through five basic components:

Protocol or application
Hostname
TCP/IP port (optional)
Directory (optional)
Filename (optional)

List of URL Protocols

file – Accesses a local file (file:///etc/syslog.conf)
ftp – Accesses a remote file via FTP (ftp://ftp.example.com/adduser.tar.gz)
http – Accesses a remote file via HTTP (http://example.com/index.html)
https – Accesses a remote file via HTTP/SSL (https://example.com/order.shtml)
ldap – Accesses LDAP directory services (ldap://ldap.bigfoot.com:389/cn=Herb)
mailto – Sends email to a designated address (mailto:info@yeahhub.com)

How HTTP works

HTTP is a stateless client/server protocol. A client asks the server for the “contents” of a specific URL. The server responds either with a spurt of data or with some type of error message. The client can then go on to request another object. Because HTTP is so simple, you can turn yourself into a crude web browser by running telnet. Just telnet to port 80 on your web server of choice. Once you’re connected, you can issue HTTP commands.

The most common command is GET, which requests the contents of a document. Usually, GET / is what you want, since it requests the root document (usually, the home page) of whatever server you’ve connected to. HTTP is case sensitive, so make sure you type commands in capital letters.

$ telnet localhost 80
Trying 127.0.0.1…
Connected to localhost.example.com.
Escape character is ‘^]’.
GET /

<contents of your default file appear here>
Connection closed by foreign host.

A more “complete” HTTP request would include the HTTP protocol version, the host that the request is for (required to retrieve a file from a name-based virtual host), and other information. The response would then include informational headers as well as response data. For example:

$ telnet localhost 80
Trying 127.0.0.1…
Connected to localhost.example.com.
Escape character is ‘^]’.
GET / HTTP/1.1
Host: www.example.com
HTTP/1.1 200 OK
Date: Sat, 01 Aug 2009 17:43:10 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Sat, 01 Aug 2009 16:20:22 GMT
Content-Length: 7044
Content-Type: text/html

<contents of your default file appear here>
Connection closed by foreign host.

In this case, we told the server we were going to speak HTTP protocol version 1.1 and named the virtual host from which we were requesting information. The server returned a status code (HTTP/1.1 200 OK), its idea of the current date and time, the name and version of the server software it was running, the date that the requested file was last modified, the length of the requested file, and the requested file’s content type. The header information is separated from the content by a single blank line.

Application servers

Complex enterprise applications may need more functionality than a basic HTTP server can provide. For example, modern-day Web 2.0 pages often contain a sub component that is tied to a dynamic data feed (e.g., a stock ticker). Although it’s possible to implement this functionality with Apache through technologies such as AJAX and JavaScript Object Notation (JSON), some developers prefer a more fully featured language such as Java. The common way to interface Java applications to an enterprise’s other data sources is with a “servlet.” Servlets are Java programs that run on the server on top of an application server platform. Application servers can work independently or in concert with Apache.

Most application servers were designed by programmers for programmers and lack the concise debugging mechanisms expected by system administrators.

Tomcat (Open source) tomcat.apache.org
GlassFish (Open source) glassfish.dev.java.net
JBoss (Open source) jboss.org
OC4J (Commercial) oracle.com/technology/tech/java/oc4j
WebSphere (Commercial) ibm.com/websphere
WebLogic (Commercial) oracle.com/appserver/weblogic/weblogic-suite.html
Jetty (Open source) eclipse.org/jetty

Load balancing

It’s difficult to predict how many hits (requests for objects, including images) or page views (requests for HTML pages) a server can handle per unit of time. A server’s capacity depends on the system’s hardware architecture (including subsystems), the operating system it is running, the extent and emphasis of any system tuning that has been performed, and perhaps most importantly, the construction of the sites being served.

(Do they contain only static HTML pages, or must they make database calls and numeric calculations?)

Only direct benchmarking and measurement of your actual site running on your actual hardware can answer the “how many hits?” question. Sometimes, people who have built similar sites on similar hardware can give you information that is useful for planning. In no case should you believe the numbers quoted by system suppliers. Also remember that your bandwidth is a key consideration. A single machine serving static HTML files and images can easily serve enough data to saturate an OC3 (155 Mb/s) link.

That said, instead of single-server hit counts, a better parameter to focus on is scalability; a web server typically becomes CPU- or IO-bound before saturating its Ethernet interface. Make sure that you and your web design team plan to spread the load of a heavily trafficked site across multiple servers.

Load balancing adds both performance and redundancy. Several different load balancing approaches are available: round robin DNS, load balancing hardware, and software-based load balancers.

Round robin DNS is the simplest and most primitive form of load balancing. In this system, multiple IP addresses are assigned to a single hostname. When a request for the web site’s IP address arrives at the name server, the client receives one of the IP addresses in response. Addresses are handed out one after another, in a repeating “round robin” sequence. Round robin load balancing is extremely common.

You may also like: