Paragraphs, Lines, and Phrases

9 Text

The following sections discuss issues surrounding the structuring of text.
Elements that present text (alignment
elements, font elements, style sheets, etc.) are discussed elsewhere in the
specification. For information about characters, please consult the section on
the document character set.

The document character set includes a wide
variety of white space characters. Many of these are typographic elements used
in some applications to produce particular visual spacing effects. In HTML,
only the following characters are defined as white space
characters:

  • ASCII space ( )
  • ASCII tab ( )
  • ASCII form feed ()
  • Zero-width space (​)

Line breaks are also white space characters. Note
that although 
 and 
 are defined in [ISO10646] to
unambiguously separate lines and paragraphs, respectively, these do not
constitute line breaks in HTML, nor does this specification include them in the
more general category of white space characters.

This specification does not indicate the behavior, rendering or otherwise,
of space characters other than those explicitly identified here as white space
characters. For this reason, authors should use appropriate elements and styles
to achieve visual formatting effects that involve white space, rather than
space characters.

For all HTML elements except
PRE, sequences of white space separate “words”
(we use the term “word” here to mean “sequences of non-white space
characters”). When formatting text, user agents should identify these words and
lay them out according to the conventions of the particular written language
(script) and target medium.

This layout may involve putting space between words (called
inter-word space), but conventions for inter-word space vary
from script to script. For example, in Latin scripts, inter-word space is
typically rendered as an ASCII space ( ), while in Thai it is a
zero-width word separator (​). In Japanese and Chinese, inter-word
space is not typically rendered at all.

Note that a sequence of white spaces between words in the source document
may result in an entirely different rendered inter-word spacing (except in the
case of the
PRE element). In particular, user agents should
collapse input white space sequences when producing output
inter-word space. This can and should be done even in the absence of language
information (from the lang attribute, the HTTP “Content-Language” header field (see [RFC2616], section
14.12), user agent settings, etc.).

The
PRE element is used for preformatted
text, where white space is significant.

In order to avoid problems with SGML line break rules and
inconsistencies among extant implementations, authors should not rely on user
agents to render white space immediately after a start tag or immediately
before an end tag. Thus, authors, and in particular authoring tools,
should write:

  <P>We offer free <A>technical support</A> for subscribers.</P>

and not:

  <P>We offer free<A> technical support </A>for subscribers.</P>

9.2 Structured text

9.2.1 Phrase elements: EM,
STRONG, DFN, CODE, SAMP,
KBD, VAR, CITE, ABBR, and
ACRONYM

Start tag: required, End tag:
required

Phrase elements add structural information to text fragments. The usual
meanings of phrase elements are following:

EM:
Indicates emphasis.
STRONG:
Indicates stronger emphasis.
CITE:
Contains a citation or a reference to other sources.
DFN:
Indicates that this is the defining instance of the enclosed term.
CODE:
Designates a fragment of computer code.
SAMP:
Designates sample output from programs, scripts, etc.
KBD:
Indicates text to be entered by the user.
VAR:
Indicates an instance of a variable or program argument.
ABBR:
Indicates an abbreviated form (e.g., WWW, HTTP, URI, Mass., etc.).
ACRONYM:
Indicates an acronym (e.g., WAC, radar, etc.).

EM
and
STRONG are used to indicate emphasis. The other phrase elements have
particular significance in technical documents. These examples illustrate some
of the phrase elements:

As <CITE>Harry S. Truman</CITE> said,
<Q lang="en-us">The buck stops here.</Q>

More information can be found in <CITE>[ISO-0000]</CITE>.

Please refer to the following reference number in future
correspondence: <STRONG>1-234-55</STRONG>

The presentation of phrase elements depends on the user agent. Generally,
visual user agents present EM text in italics and STRONG text in bold font. Speech
synthesizer user agents may change the synthesis parameters, such as volume,
pitch and rate accordingly.

The
ABBR and ACRONYM elements allow authors to clearly indicate
occurrences of abbreviations and acronyms.
Western languages make extensive use of acronyms such as “GmbH”, “NATO”, and
“F.B.I.”, as well as abbreviations like “M.”, “Inc.”, “et al.”, “etc.”. Both
Chinese and Japanese use analogous abbreviation mechanisms, wherein a long name
is referred to subsequently with a subset of the Han characters from the
original occurrence. Marking up these constructs provides useful information to
user agents and tools such as spell checkers, speech synthesizers, translation
systems and search-engine indexers.

The content of the ABBR and
ACRONYM elements specifies the abbreviated
expression itself, as it would normally appear in running text. The title
attribute of these elements may be used to provide the full or expanded form of
the expression.

Here are some sample uses of
ABBR:

  <P>
  <ABBR title="World Wide Web">WWW</ABBR>
  <ABBR lang="fr" 
        title="Soci&eacute;t&eacute; Nationale des Chemins de Fer">
     SNCF
  </ABBR>
  <ABBR lang="es" title="Do&ntilde;a">Do&ntilde;a</ABBR>
  <ABBR title="Abbreviation">abbr.</ABBR>

Note that abbreviations and acronyms often have idiosyncratic
pronunciations. For example, while “IRS” and “BBC” are typically pronounced
letter by letter, “NATO” and “UNESCO” are pronounced phonetically. Still other
abbreviated forms (e.g., “URI” and “SQL”) are spelled out by some people and
pronounced as words by other people. When necessary, authors should use style
sheets to specify the pronunciation of an abbreviated form.

9.2.2 Quotations: The BLOCKQUOTE and
Q elements

<!ELEMENT BLOCKQUOTE - - (%block;|SCRIPT)+ -- long quotation -->
<!ATTLIST BLOCKQUOTE
  %attrs;                              -- %coreattrs, %i18n, %events --
  cite        %URI;          #IMPLIED  -- URI for source document or msg --
  >
<!ELEMENT Q - - (%inline;)*            -- short inline quotation -->
<!ATTLIST Q
  %attrs;                              -- %coreattrs, %i18n, %events --
  cite        %URI;          #IMPLIED  -- URI for source document or msg --
  >

Start tag: required, End tag:
required

Attribute definitions

cite = uri [CT]
The value of this attribute is a URI that designates a source document or
message. This attribute is intended to give information about the source from
which the quotation was borrowed.

These two elements designate quoted text.
BLOCKQUOTE is for long quotations (block-level content) and Q is intended
for short quotations (inline content) that don’t require paragraph breaks.

This example formats an excerpt from “The Two Towers”, by J.R.R. Tolkien, as
a blockquote.

<BLOCKQUOTE cite="http://www.mycom.com/tolkien/twotowers.html">
<P>They went in single file, running like hounds on a strong scent,
and an eager light was in their eyes. Nearly due west the broad
swath of the marching Orcs tramped its ugly slot; the sweet grass
of Rohan had been bruised and blackened as they passed.</P>
</BLOCKQUOTE>

Visual user agents generally render BLOCKQUOTE as an indented
block.

Visual user agents must ensure that the content of the Q element is
rendered with delimiting quotation marks. Authors should not put quotation
marks at the beginning and end of the content of a Q element.

User agents should render quotation marks in a language-sensitive manner
(see the
lang attribute). Many languages adopt different quotation styles for
outer and inner (nested) quotations, which should be respected by
user-agents.

The following example illustrates nested quotations with the Q element.

John said, <Q lang="en-us">I saw Lucy at lunch, she told me
<Q lang="en-us">Mary wants you
to get some ice cream on your way home.</Q> I think I will get
some at Ben and Jerry's, on Gloucester Road.</Q>

Since the language of both quotations is American English, user agents
should render them appropriately, for example with single quote marks around
the inner quotation and double quote marks around the outer quotation:

  John said, "I saw Lucy at lunch, she told me 'Mary wants you
  to get some ice cream on your way home.' I think I will get some
  at Ben and Jerry's, on Gloucester Road."

Note. We recommend that style sheet implementations
provide a mechanism for inserting quotation marks before and after a quotation
delimited by BLOCKQUOTE in a manner appropriate to the current language
context and the degree of nesting of quotations.

However, as some authors have used BLOCKQUOTE merely as a mechanism
to indent text, in order to preserve the intention of the authors, user agents
should not insert quotation marks in the default
style.

The usage of BLOCKQUOTE to indent text is deprecated in favor of style sheets.

9.2.3 Subscripts and superscripts: the SUB and SUP elements

Start tag: required, End tag:
required

Many scripts (e.g., French) require superscripts or subscripts for proper
rendering. The
SUB and SUP elements should be used to markup text in these
cases.

      H<sub>2</sub>O
      E = mc<sup>2</sup>
      <SPAN lang="fr">M<sup>lle</sup> Dupont</SPAN>

9.3 Lines and Paragraphs

Authors traditionally divide their thoughts and arguments into sequences of
paragraphs. The organization of information into paragraphs is not affected by
how the paragraphs are presented: paragraphs that are double-justified contain
the same thoughts as those that are left-justified.

The HTML markup for defining a paragraph is straightforward: the P element
defines a paragraph.

The visual presentation of paragraphs is not so simple. A number of issues,
both stylistic and technical, must be addressed:

  • Treatment of white space
  • Line breaking and word wrapping
  • Justification
  • Hyphenation
  • Written language conventions and text directionality
  • Formatting of paragraphs with respect to surrounding content

We address these questions below. Paragraph alignment and floating
objects are discussed later in this document.

9.3.1 Paragraphs: the P element

Start tag: required, End tag:
optional

The
P element represents a paragraph. It cannot contain block-level elements (including P itself).

We discourage authors from using empty P elements. User agents should ignore
empty
P elements.

A
line break is defined to be a carriage return ( ),
a line feed ( ), or a carriage return/line feed pair. All line
breaks constitute white space.

For more information about SGML’s specification of line breaks, please
consult the notes on line
breaks in the appendix.

<!ELEMENT BR - O EMPTY                 -- forced line break -->
<!ATTLIST BR
  %coreattrs;                          -- id, class, style, title --
  >

Start tag: required, End tag:
forbidden

The
BR element forcibly breaks (ends) the current line of text.

For visual user agents, the clear attribute can be used to
determine whether markup following the BR element flows around images and
other objects floated to the left or right margin, or whether it starts after
the bottom of such objects. Further details are given in the section on alignment and floating objects.
Authors are advised to use style sheets to control text flow around floating
images and other objects.

With respect to bidirectional
formatting, the BR element should behave the same way the
[ISO10646] LINE SEPARATOR character behaves in the bidirectional
algorithm.

Sometimes authors may want to prevent a line break from occurring between
two words. The &nbsp; entity (  or  ) acts as a space
where user agents should not cause a line break.

In HTML, there are two types of hyphens: the plain hyphen and the soft
hyphen. The plain hyphen should be interpreted by a user agent as just another
character. The soft hyphen tells the user agent where a line
break can occur.

Those browsers that interpret soft hyphens must observe the following
semantics: If a line is broken at a soft hyphen, a hyphen character must be
displayed at the end of the first line. If a line is not broken at a soft
hyphen, the user agent must not display a hyphen character. For operations such
as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the “-” character (- or
-). The soft hyphen is represented by the character entity reference
&shy; (­ or ­)

Start tag: required, End tag:
required

Attribute definitions

width = number [CN]
Deprecated. This
attribute provides a hint to visual user agents about the desired width of the
formatted block. The user agent can use this information to select an
appropriate font size or to indent the content appropriately. The desired width
is expressed in number of characters. This attribute is not widely supported
currently.

The
PRE element tells visual user agents that the enclosed text is “preformatted”. When handling preformatted text,
visual user agents:

  • May leave white space intact.
  • May render text with a fixed-pitch font.
  • May disable automatic word wrap.
  • Must not disable bidirectional processing.

Non-visual user agents are not required to respect extra white space in the content of a PRE element.

For more information about SGML’s specification of line breaks, please
consult the notes on line
breaks in the appendix.

The DTD fragment above indicates which elements may not appear within a PRE
declaration. This is the same as in HTML 3.2, and is intended to preserve
constant line spacing and column alignment for text rendered in a fixed pitch
font. Authors are discouraged from altering this behavior through style
sheets.

The following example shows a preformatted verse from Shelly’s poem To a
Skylark:

<PRE>
       Higher still and higher
         From the earth thou springest
       Like a cloud of fire;
         The blue deep thou wingest,
And singing still dost soar, and soaring ever singest.
</PRE>

Here is how this is typically rendered:

       Higher still and higher
         From the earth thou springest
       Like a cloud of fire;
         The blue deep thou wingest,
And singing still dost soar, and soaring ever singest.

The horizontal tab character

The horizontal tab character (decimal 9 in [ISO10646] and
[ISO88591] ) is usually interpreted by visual user agents as the smallest
non-zero number of spaces necessary to line characters up along tab stops that
are every 8 characters. We strongly discourage using horizontal tabs in
preformatted text since it is common practice, when editing, to set the
tab-spacing to other values, leading to misaligned documents.

Note. The following section is an informative
description of the behavior of some current visual user agents when formatting
paragraphs. Style sheets allow better control of paragraph formatting.

How paragraphs are rendered visually depends on the user agent. Paragraphs
are usually rendered flush left with a ragged right margin. Other defaults are
appropriate for right-to-left scripts.

HTML user agents have traditionally rendered paragraphs with white space
before and after, e.g.,

  At the same time, there began to take form a system of numbering,
  the calendar, hieroglyphic writing, and a technically advanced
  art, all of which later influenced other peoples.

  Within the framework of this gradual evolution or cultural
  progress the Preclassic horizon has been divided into Lower,
  Middle and Upper periods, to which can be added a transitional
  or Protoclassic period with several features that would later
  distinguish the emerging civilizations of Mesoamerica.

This contrasts with the style used in novels which indents the first line of
the paragraph and uses the regular line spacing between the final line of the
current paragraph and the first line of the next, e.g.,

     At the same time, there began to take form a system of
  numbering, the calendar, hieroglyphic writing, and a technically
  advanced art, all of which later influenced other peoples.
     Within the framework of this gradual evolution or cultural
  progress the Preclassic horizon has been divided into Lower,
  Middle and Upper periods, to which can be added a transitional
  or Protoclassic period with several features that would later
  distinguish the emerging civilizations of Mesoamerica.

Following the precedent set by the NCSA Mosaic browser in 1993, user agents
generally don’t justify both margins, in part because it’s hard to do this
effectively without sophisticated hyphenation routines. The advent of style
sheets, and anti-aliased fonts with subpixel positioning promises to offer
richer choices to HTML authors than previously possible.

Style sheets provide rich control over the size and style of a font, the
margins, space before and after a paragraph, the first line indent,
justification and many other details. The user agent’s default style sheet
renders
P elements in a familiar form, as illustrated above. One could, in
principle, override this to render paragraphs without the breaks that
conventionally distinguish successive paragraphs. In general, since this may
confuse readers, we discourage this practice.

By convention, visual HTML user agents wrap text
lines to fit within the available margins. Wrapping algorithms
depend on the script being formatted.

In Western scripts, for example, text should only be wrapped at white space.
Early user agents incorrectly wrapped lines just after the start tag or just
before the end tag of an element, which resulted in dangling punctuation. For
example, consider this sentence:

   A statue of the <A href="cih78">Cihuateteus</A>, who are patron ...

Wrapping the line just before the end tag of the A element causes the comma to be
stranded at the beginning of the next line:

  A statue of the Cihuateteus
  , who are patron ...

This is an error since there was no white space at that point in the
markup.

9.4 Marking document changes: The
INS and DEL elements

<!-- INS/DEL are handled by inclusion on BODY -->
<!ELEMENT (INS|DEL) - - (%flow;)*      -- inserted text, deleted text -->
<!ATTLIST (INS|DEL)
  %attrs;                              -- %coreattrs, %i18n, %events --
  cite        %URI;          #IMPLIED  -- info on reason for change --
  datetime    %Datetime;     #IMPLIED  -- date and time of change --
  >

Start tag: required, End tag:
required

Attribute definitions

cite = uri [CT]
The value of this attribute is a URI that designates a source document or
message. This attribute is intended to point to information explaining why a
document was changed.
= datetime [CS]
The value of this attribute specifies the

when the change was made.

INS and DEL are used to markup sections of the document that have
been inserted or deleted with respect to a different
version of a document (e.g., in draft legislation where lawmakers need to view
the changes).

These two elements are unusual for HTML in that they may serve as either
block-level or inline elements (but not both). They may contain one or more
words within a paragraph or contain one or more block-level elements such as
paragraphs, lists and tables.

This example could be from a bill to change the legislation for how many
deputies a County Sheriff can employ from 3 to 5.

<P>
  A Sheriff can employ <DEL>3</DEL><INS>5</INS> deputies.
</P>

The
INS and DEL elements must not contain block-level content when these
elements behave as inline elements.

ILLEGAL EXAMPLE:

The following is not legal HTML.

<P>
<INS><DIV>...block-level content...</DIV></INS>
</P>

User agents should render inserted and deleted text in ways that make the
change obvious. For instance, inserted text may appear in a special font,
deleted text may not be shown at all or be shown as struck-through or with
special markings, etc.

Both of the following examples correspond to November 5, 1994, 8:15:30 am,
US Eastern Standard Time.

     1994-11-05T13:15:30Z
     1994-11-05T08:15:30-05:00

Used with
INS, this gives:

<INS datetime="1994-11-05T08:15:30-05:00"
        cite="http://www.foo.org/mydoc/comments.html">
Furthermore, the latest figures from the marketing department
suggest that such practice is on the rise.
</INS>

The document “http://www.foo.org/mydoc/comments.html” would contain comments
about why information was inserted into the document.

Authors may also make comments about inserted or deleted text by means of
the
title attribute for the
INS and DEL elements. User agents may present
this information to the user (e.g., as a popup note). For example:

<INS datetime="1994-11-05T08:15:30-05:00"
        title="Changed as a result of Steve B's comments in meeting.">
Furthermore, the latest figures from the marketing department
suggest that such practice is on the rise.
</INS>