This document gives an introduction to HTML, the Hypertext Markup Language, including its purpose and structure, and some details about elements and attributes.
HTML is the markup language used to describe Web pages. Most of the Web pages on the Web are written in HTML. HTML is not a programming language, although it may contain pieces of program code.
A markup language is a digital format that adds marks to a document. These marks show the document structure and indicate the roles of each of the document's parts. Adding such marks to a document is called to mark up a document. There are many markup languages, for many different kinds of document. But HTML is by far the most widely known and used markup language.
As the name Hypertext Markup Language indicates, hypertext is an important part of HTML. Hypertext means that an HTML document can link to other Web documents, and can also include other documents, in particular images.
HTML only gives the structure of a document and the role of each of its parts. It does not say anything about a document's presentation or style. So the color of the text and the background, the size of the text, the space between lines and paragraphs, the left and right margin, and many other details cannot be specified with HTML alone. This is what CSS is for. This is why a simple HTML document, without CSS, will look quite boring.
An HTML document can be viewed in two different ways: As plain text source or in a specialized viewer. The specialized viewers used for HTML are called browsers. You are probably using a browser every day to view Web pages written in HTML. You can view and edit a Web page as plain text source by opening the file with a plain text editor. You can also view the HTML source of a Web page by pressing Ctrl-U in a browser. Looking at the source of actual Web pages these days may be confusing, because they contain a lot of confusing details, but they are still HTML.
This is the source of a very simple example of an HTML document:
<!DOCTYPE html> <html lang='en-US'> <meta charset="utf-8"> <head> <title>Simple Example of Web Page</title> </head> <body> <h1>Simple Example of Web Page</h1> <p>This is a simple Web page. You can use it as a start to make <strong>your own</strong> Web page.</p> </body> </html>
You can put the above text into a file called example.html
and
have a look at it in a browser.
As we can see from the above example, an HTML document starts with a
document type declaration on the first line. Our example shows the
simplest document type declaration, which is used in modern versions of HTML.
The document type declaration is followed by a head
part and a
body
part inside the html
element.
The head
part contains information about the
document. The meta
element indicates the character encoding of
the document. It is best to always use UTF-8, but if you use another character
encoding, make sure that you adjust this information.
Every HTML document has to have a title
element. The
contents of the title
element is displayed in the title bar of the
browser. Always make sure you have a good title, otherwise, your Web page can
look embarrassing.
The body
part contains the actual Web page itself. It can
contain headers of different levels. The h1
element
indicates a level 1 (top level) header. There are also h2
,
h3
, h4
, h5
, and h6
elements, for lower-level headers.
The p
element marks up a paragraph. An average HTML document
has many paragraphs. The strong
element indicates that its content
is strongly emphasized.
The markup in an HTML document consists of elements. Most elements
have a start tag, contents, and an end tag. For the
h1
element, <h1>
is the start tag,
</h1>
is the end tag, and "Simple Example of Web Page" is
the contents. It is very important to distingush between
elements and tags.
Elements can contain other elements. In our example, the html
element contains a head
element and a body
element,
the head
element contains a title
element, and so on.
Elements can also contain text (such as the title
,
h1
, and strong
elements above), or a mixture of text
and other elements (such as the p
element).
Elements always have to be properly nested. This means that any
inner element has to end with an end tag before the outer element can be
closed. This is very similar to nesting of functions, loops, and so on in a
programming language. It is usual to show this nesting by indenting for
block-level elements (e.g. h1
, p
), but not
for inline elements (e.g. strong
).
Between block-level or higher-level tags, white space is ignored. So we
could have started the document as
<html><head><title>
…. But this would look very
bad. Inline white space is collapsed to a single space. So we can have a line
break anywhere in a paragraph. White space means half-width spaces, tabs, and
line breaks, but not full-width spaces. For Japanese, more care may be needed
to avoid spurious spaces.
Elements can have attributes that provide additional information.
In our example, the html
element has a lang
attribute. Attributes appear inside the start tag. They consist of an attribute
name (e.g. lang
) and an attribute value (e.g. en
),
separated by a =
sign. Here, the lang
attribute
indicates the natural language of the document, US English
(en-US
). Attribute values should be enclosed in single or double
quotes. There are generic attributes such as class
,
dir
(direction), id
, lang
(language),
style, and title, which can appear on any element. Other attributes can only
appear on specific elements, such as the href
attribute on the
a
element.
Characters in attribute values and textual content can be written directly,
or they can be written using numeric character references. As an
example, we can write the single-square character for the new Japanese era
令和 as ㋿
. Here, &#x
and
;
are part of the numeric character reference syntax, and
32FF
is the number of the new character named SQUARE ERA
NAME REIWA
. Such a notation will help as long as there is no font
support for this character. As a small exercise, try to write your name with
numeric character references.
Because <
, >
, '
,
"
, and &
have special meanings in HTML documents,
they have to be escaped when they appear as ordinary characters in attribute
values or text. We can use numeric character references, or the specially
defined character entities <
,
>
, '
, "e;
, and
&
.
Block-level elements are elements similar to paragraphs. Here is a table of the most important ones.
Element Name | Start Tag | End Tag | Meaning/Usage | Additional Information |
---|---|---|---|---|
p |
<p> |
</p> |
paragraph | |
address |
<address> |
</address> |
contact information | |
aside |
<aside> |
</aside> |
content aside from the main flow of the document | |
blockquote |
<blockquote> |
</blockquote> |
quotation in block form | |
nav |
<nav> |
</nav> |
navigation (links) | |
header |
<header> |
</header> |
header of page or section | also footer |
section |
<section> |
</section> |
section, should start with a heading | |
pre |
<pre> |
</pre> |
preformatted text | e.g. source code |
ol |
<ol> |
</ol> |
ordered (numbered) list | also ul , li , dl ,
dd , dt |
div |
<div> |
</div> |
generic document division | use whenever there is nothing more appropriate |
table |
<table> |
</table> |
table | also tbody , tr , th ,
td ,... |
form |
<form> |
<form> |
form for data input |
For more block-level elements, see the Mozilla developer page.
There are many inline HTML elements. Here we list the most important ones:
a
href
attributecode
em
img
source
and alt
attributes
(the alt
attribute is important for accessibility)ruby
rp
and
rt
span
strong
For more inline elements, see the
Mozilla
developer page. There are some inline elements that should be avoided,
because they are presentational. Examples are big
,
b
(bold), i
(italic), small
,
u
(underline).
HTML pages should always be validated to make sure the syntax is correct. Browsers tolerate some errors. But you do not want to produce HTML pages with errors. So always validate your documents.