Washington Technologies White Papers
What is SGML?
B. Tommie Usdin, President, Mulberry Technologies, Inc.
Deborah A. Lapeyre, Vice-President, Mulberry Technologies, Inc.
SGML, pronounced "Ess-Gee-Em-Ell", is the
acronym for Standard Generalized Markup Language. SGML
is an international standard for encoding the structure and/or content of machine-readable
information. (SGML was defined by the International
Organization for Standardization in 1986 in the standard "ISO
8879".) SGML "documents" usually
consist of text, graphics, and hypertext links. SGML
typically describes technical documents, although SGML
has been used to describe things as diverse as music, mathematics, and
airplane and automobile components.
In SGML, tags (human and machine-readable markers)
are inserted into data to identity relevant parts of the data. What
difference does identifying parts make? Once identified, a piece of
information can be: manipulated, verified, printed on paper, displayed on
a screen, searched for words or phrases, or reused for something else.
SGML does for textual information much of what a
database does for fixed-size data fields: it identifies them, names them,
and describes their relationships so they can be managed and manipulated.
SGML-based applications have been used for functions
as diverse as typesetting, indexing for CD-ROM
distribution, serving hypertext over the Web, and translation into foreign
languages.
Why such a fuss about a data format? It didn't used to matter what
format data was created and stored in. One person used a personal computer
with a word-processor, another used a typewriter (or the typing pool), and
management reports were generated on a mainframe. Data was exchanged on
paper: computer printouts, typewriter or word-processor reports, and
faxes. Information was retyped or photocopied to be reused. But that was a
long time ago. Now users receive data on their desktops from all over the
world. They need to manage that data; to analyze it; to slice it and dice
it and distribute it in various forms to many other people -- to reuse
that data in many ways. Nobody can afford the time or error-rate for
re-keying, and filters to convert data from one format to another are
unreliable at best. They may mangle format, truncate data, or eat special
characters.
So suddenly data format matters. We could solve this problem by all
using the same platform -- hardware and software. That won't happen. We
could reduce the problem by all using the same software. And this is
equally unlikely. So, many people are solving the problem by using
platform independent data formats: SGML and XML.
A major advantage of SGML is platform independence.
That not only means that SGML created on today's
Unix workstations can be used on mainframes,
Macintoshes, and PCs; but also that data tagged in
SGML today won't need to be converted to work with
tomorrow's machines, whatever they are. Machine-independence has to be a
top priority for anyone who needs their data to live over time.
SGML authoring separates writing content from
specifying format. Work-habits studies have found that authors spend at
least 30% of their time tweaking format, often in violation of
organizational style guidelines. When writing an SGML
document, the author identifies the document part (title, procedure list,
footnote) and provides its content. They don't say what that part will
look like; that is defined separately, and typically by someone else. The
writer doesn't waste time in what the text looks like. (This does NOT mean
that the writer must look at tags and codes. While in most applications
tags are visible on demand, writers usually work in SGML-aware
editors in WYSIWYG fashion.)
A major strength of SGML is the ability to produce
as many different styles, looks, and formats as needed from the same
SGML-tagged text without manual intervention and
without the errors that may introduce. A variety of presentation formats
can be produced from the same information, such as: a public version with
unclassified information and a restricted version with additional
material; a printed book with CD-ROM and Web versions;
or a repair manual in voice synthesized, desk reference, and pocket
versions.
The same SGML text can be used in many different
publications, without rekeying or re-coding. Data lives throughout the
life-cycle of a product or service. Functional specifications, written in
SGML, can be used by designers, who modify the
information and pass it to testers, who pass their results to the
production shop and technical writers, who create documents and databases
for the showroom sales force and customers. SGML makes
it easy to repurpose information; producing both a text book and a
teacher's guide, an encyclopedia and several subject-specific handbooks,
repair manuals and sales literature customized for the product or for a
specific user.
Documents are divided into useful, named pieces, called "elements".
Elements can be structures such as titles, sections, paragraphs, and
figures or pieces of important content such as drug names, side-effects
recommendations, summaries, or part numbers. Codes called tags (that are
both human and machine-readable) are embedded in the data to identify the
beginning and the end of each element. Tags can also contain additional
information about the element (such as security level, revision date, or
source). Each tag is named and defined in an information model called a
DTD (Document Type Definition). DTDs
provide the rules for structuring SGML data the way
that database schemas provide rules for structuring databases. SGML
parsers are software that can verify that information is structured and
tagged according the rules in a DTD. Manipulation and
display is based on the tags; the information inside <example> may
be indented, <title> may appear big, bold, and centered, the
contents of <index> may be alphabetized and linked into the text
creating an index, the contents of other tags replicated as a table of
contents or in running heads in a print document.
SGML is a system for creating tags; SGML
is not a particular set of tags. Many shared sets of tags have been
created using SGML, and many users create tags to meet
their individual needs.
Since SGML is often used for interchanging
information, it is convenient for the people and organizations who need to
interchange data to agree on their textual markup. In creating an
SGML application, groups can create a format for data
that they can all produce and use, even if they use different hardware and
software systems. Applications of SGML include:
-
HTML -- The most widely used application of
SGML is HTML, the data format
behind most of the pages on the WWW and intranets
around the world. To use HTML, tags are embedded in
data (W3C provides the tag set) that drive hypertext
links and browser formatting. SGML and XML
let users create their own organization- or industry-specific tags and
use them, not just to make hypertext links work and display on browsers,
but for document security, version control, context-specific searching,
and much, much more. The tags are under user control and can be designed
to meet user needs.
-
DoD
CALS -- The United States
DoD has developed a suite of SGML
applications, called the CALS standard. (Many parts
of this standard, including IETMs (Interactive
Electronic Technical Manuals) have been adopted by the defense agencies
in Canada, Japan, Australia, and many European countries.)
- Industry Standards -- Among the industries who have developed shared
SGML applications and DTDs are:
airline and aircraft, telecommunications, automotive and trucking,
semiconductor, newspaper, and publishing. (A medical standard is under
development.)
- Government -- Government agencies and departments among the first to
increase the value of their information with SGML
include: the Intelligence Community, SEC, DOE,
the House and the Senate, NIST, NLM,
IRS, GPO, PTO,
FDA, and CRS. Foreign SGML
users include NATO forces, the World Court, and the
Swedish FDA.
- Academic -- Academic projects run the range from coding all known
works in ancient Greek to analyzing the influences of popular culture,
to tracking wording changes in the Christian Bible.
Major commercial SGML users include: Caterpillar,
Novel, Microsoft, John Deere, Sun, McGraw-Hill, Silicon Graphics, RR
Donnelley, The Boeing Company, Walt Disney Imagineering, Nortel, Ricoh,
Siemens, and Medical Economics; each of whom has created tags and full
SGML applications to meet their needs.
SGML is highly compatible with both existing and
especially SGML-created databases: the objects stored
in a database can also be SGML elements; SGML
can function as an output and interchange format from a database; and
SGML-tagged data can be used to load a database.
Is SGML for everyone and everything? Of course not!
There is always a faster, cheaper, easier way to produce any one
information product than using SGML. Most organizations get into
SGML when "one thing" is no longer enough.
No technology short of SGML will support producing
searchable pages on the Web, and a full-text search system on
CD-ROM, and both full-sized and pocket-sized
print editions, and Braille and voice-synthesis deliveries,
and creation of client-specific documents on the fly, and
instant updates to a parts catalog database which are reflected instantly
in print and electronic sales literature. SGML is up
to the challenge!
A Glossary of XML/SGML Terms and Acronyms
What Does "SGML" Stand For?
- Standard
-
SGML is an International Organization for
Standardization standard, is one of the CALS
standards, and is recognized by ANSI and NISO.
- Generalized
- Not specific to any machine, operating system, software package, or
type of information; non-proprietary (not owned, maintained, or
controlled by a vendor); not one set of tags.
- Markup
- Markup is the information added to information to make the
information more useful. SGML tags are markup inside
information that identify parts of the information.
- Language
- Not a natural language (like Spanish or English); not a computer
language (like Java or C++); SGML is a
meta-language, a system of rules for creating markup.
What Do Those Hard Words Mean?
- Document Type Definition (DTD)
- A formal model of the structure of information; defines the tags
permitted in the documents and describes the relationships between
elements.
- Document Instance
- The textual data, including the tags that identify various parts of
the information. (The tags are between the "<" and ">"
symbols!)
- Output Specification
- A description (such as a style sheet) of one of the ways the document
should look to or behave for the end user.
- Rendered Document
- An SGML document in one of the forms in which it
will be used.
- In-process Document
- Most tools for creating and editing SGML
documents have user-friendly front ends that make the creation and
editing of the information much easier than actually looking at <tags>.
What Does That Acronym Stand For?
-
SGML
- (Standard Generalized Markup Language) The international standard for
encoding information based on the structure and content of the
information. SGML data is platform-independent,
vendor-independent, and media-independent so it can be used and reused
in a wide variety of applications.
-
HTML
- (HyperText Markup Language) The most common language for creating
pages and forms to display on the Internet. HTML, as
many organizations use it, is an application of SGML.
-
XML
- (Extensible Markup Language) An extremely simple dialect of
SGML designed to be served, received, and processed
on the Internet's WWW.
-
W3C
- (World Wide Web Consortium) Major consortium of vendors and
organizations that sets the rules for the WWW. The
people who brought you HTML.
-
ISO
- (International Organization for Standardization) International body
that writes and sponsors many multinational standards like SGML
(ISO 8879). The people who brought you ISO
9000.
© Mulberry Technologies, Inc. 1997
|