Introduction to HTML

This document gives an introduction to HTML, the Hypertext Markup Language, including its purpose and structure, and some details about elements and attributes.

Purpose

HTML is the markup language used to describe Web pages. Most of the Web pages on the Web are written in HTML. HTML is not a programming language, although it may contain pieces of program code.

A markup language is a digital format that adds marks to a document. These marks show the document structure and indicate the roles of each of the document's parts. Adding such marks to a document is called to mark up a document. There are many markup languages, for many different kinds of document. But HTML is by far the most widely known and used markup language.

As the name Hypertext Markup Language indicates, hypertext is an important part of HTML. Hypertext means that an HTML document can link to other Web documents, and can also include other documents, in particular images.

HTML only gives the structure of a document and the role of each of its parts. It does not say anything about a document's presentation or style. So the color of the text and the background, the size of the text, the space between lines and paragraphs, the left and right margin, and many other details cannot be specified with HTML alone. This is what CSS is for. This is why a simple HTML document, without CSS, will look quite boring.

An HTML document can be viewed in two different ways: As plain text source or in a specialized viewer. The specialized viewers used for HTML are called browsers. You are probably using a browser every day to view Web pages written in HTML. You can view and edit a Web page as plain text source by opening the file with a plain text editor. You can also view the HTML source of a Web page by pressing Ctrl-U in a browser. Looking at the source of actual Web pages these days may be confusing, because they contain a lot of confusing details, but they are still HTML.

Overall Structure

As we can see from the above example, an HTML document starts with a document type declaration on the first line. Our example shows the simplest document type declaration, which is used in modern versions of HTML. The document type declaration is followed by a head part and a body part inside the html element.

The head part contains information about the document. The meta element indicates the character encoding of the document. It is best to always use UTF-8, but if you use another character encoding, make sure that you adjust this information. Every HTML document has to have a title element. The contents of the title element is displayed in the title bar of the browser. Always make sure you have a good title, otherwise, your Web page can look embarrassing.

The body part contains the actual Web page itself. It can contain headers of different levels. The h1 element indicates a level 1 (top level) header. There are also h2, h3, h4, h5, and h6 elements, for lower-level headers.

The p element marks up a paragraph. An average HTML document has many paragraphs. The strong element indicates that its content is strongly emphasized.

Basic Syntax

The markup in an HTML document consists of elements. Most elements have a start tag, contents, and an end tag. For the h1 element, <h1> is the start tag, </h1> is the end tag, and "Simple Example of Web Page" is the contents. It is very important to distingush between elements and tags.

Elements can contain other elements. In our example, the html element contains a head element and a body element, the head element contains a title element, and so on. Elements can also contain text (such as the title, h1, and strong elements above), or a mixture of text and other elements (such as the p element).

Elements always have to be properly nested. This means that any inner element has to end with an end tag before the outer element can be closed. This is very similar to nesting of functions, loops, and so on in a programming language. It is usual to show this nesting by indenting for block-level elements (e.g. h1, p), but not for inline elements (e.g. strong).

Between block-level or higher-level tags, white space is ignored. So we could have started the document as <html><head><title>…. But this would look very bad. Inline white space is collapsed to a single space. So we can have a line break anywhere in a paragraph. White space means half-width spaces, tabs, and line breaks, but not full-width spaces. For Japanese, more care may be needed to avoid spurious spaces.

Elements can have attributes that provide additional information. In our example, the html element has a lang attribute. Attributes appear inside the start tag. They consist of an attribute name (e.g. lang) and an attribute value (e.g. en), separated by a = sign. Here, the lang attribute indicates the natural language of the document, US English (en-US). Attribute values should be enclosed in single or double quotes. There are generic attributes such as class, dir (direction), id, lang (language), style, and title, which can appear on any element. Other attributes can only appear on specific elements, such as the href attribute on the a element.

Characters in attribute values and textual content can be written directly, or they can be written using numeric character references. As an example, we can write the single-square character for the new Japanese era 令和 as ㋿. Here, &#x and ; are part of the numeric character reference syntax, and 32FF is the number of the new character named SQUARE ERA NAME REIWA. Such a notation will help as long as there is no font support for this character. As a small exercise, try to write your name with numeric character references.

Because <, >, ', ", and & have special meanings in HTML documents, they have to be escaped when they appear as ordinary characters in attribute values or text. We can use numeric character references, or the specially defined character entities <, >, ', &quote;, and &.

Block-Level Elements

Block-level elements are elements similar to paragraphs. Here is a table of the most important ones.

Element Name	Start Tag	End Tag	Meaning/Usage	Additional Information
`p`	`<p>`	`</p>`	paragraph
`address`	`<address>`	`</address>`	contact information
`aside`	`<aside>`	`</aside>`	content aside from the main flow of the document
`blockquote`	`<blockquote>`	`</blockquote>`	quotation in block form
`nav`	`<nav>`	`</nav>`	navigation (links)
`header`	`<header>`	`</header>`	header of page or section	also `footer`
`section`	`<section>`	`</section>`	section, should start with a heading
`pre`	`<pre>`	`</pre>`	preformatted text	e.g. source code
`ol`	`<ol>`	`</ol>`	ordered (numbered) list	also `ul`, `li`, `dl`, `dd`, `dt`
`div`	`<div>`	`</div>`	generic document division	use whenever there is nothing more appropriate
`table`	`<table>`	`</table>`	table	also `tbody`, `tr`, `th`, `td`,...
`form`	`<form>`	`<form>`	form for data input

For more block-level elements, see the Mozilla developer page.

Inline Elements

There are many inline HTML elements. Here we list the most important ones:

a: anchor, used for links, with the href attribute
code: inline source code and similar text
em: emphasis
img: image, with the source and alt attributes (the alt attribute is important for accessibility)
ruby: ruby annotations (furigana), also rp and rt
span: generic inline element, use if some markup is needed, but there is no better element
strong: strong emphasis

For more inline elements, see the Mozilla developer page. There are some inline elements that should be avoided, because they are presentational. Examples are big, b (bold), i (italic), small, u (underline).

Projects in Information Technology II Resources: HTML, CSS