Mulberry home page

Mulberry Technologies, Inc.

Mulberry Home Page
Washington Technologies White Papers

Washington Technologies White Papers

What is SGML?

B. Tommie Usdin, President, Mulberry Technologies, Inc.
Deborah A. Lapeyre, Vice-President, Mulberry Technologies, Inc.

SGML, pronounced "Ess-Gee-Em-Ell", is the acronym for Standard Generalized Markup Language. SGML is an international standard for encoding the structure and/or content of machine-readable information. (SGML was defined by the International Organization for Standardization in 1986 in the standard "ISO 8879".) SGML "documents" usually consist of text, graphics, and hypertext links. SGML typically describes technical documents, although SGML has been used to describe things as diverse as music, mathematics, and airplane and automobile components.

In SGML, tags (human and machine-readable markers) are inserted into data to identity relevant parts of the data. What difference does identifying parts make? Once identified, a piece of information can be: manipulated, verified, printed on paper, displayed on a screen, searched for words or phrases, or reused for something else.

SGML does for textual information much of what a database does for fixed-size data fields: it identifies them, names them, and describes their relationships so they can be managed and manipulated. SGML-based applications have been used for functions as diverse as typesetting, indexing for CD-ROM distribution, serving hypertext over the Web, and translation into foreign languages.

Platform and Vendor Independence

Why such a fuss about a data format? It didn't used to matter what format data was created and stored in. One person used a personal computer with a word-processor, another used a typewriter (or the typing pool), and management reports were generated on a mainframe. Data was exchanged on paper: computer printouts, typewriter or word-processor reports, and faxes. Information was retyped or photocopied to be reused. But that was a long time ago. Now users receive data on their desktops from all over the world. They need to manage that data; to analyze it; to slice it and dice it and distribute it in various forms to many other people -- to reuse that data in many ways. Nobody can afford the time or error-rate for re-keying, and filters to convert data from one format to another are unreliable at best. They may mangle format, truncate data, or eat special characters.

So suddenly data format matters. We could solve this problem by all using the same platform -- hardware and software. That won't happen. We could reduce the problem by all using the same software. And this is equally unlikely. So, many people are solving the problem by using platform independent data formats: SGML and XML.

A major advantage of SGML is platform independence. That not only means that SGML created on today's Unix workstations can be used on mainframes, Macintoshes, and PCs; but also that data tagged in SGML today won't need to be converted to work with tomorrow's machines, whatever they are. Machine-independence has to be a top priority for anyone who needs their data to live over time.

Separates Specifying Presentation From Creating Content

SGML authoring separates writing content from specifying format. Work-habits studies have found that authors spend at least 30% of their time tweaking format, often in violation of organizational style guidelines. When writing an SGML document, the author identifies the document part (title, procedure list, footnote) and provides its content. They don't say what that part will look like; that is defined separately, and typically by someone else. The writer doesn't waste time in what the text looks like. (This does NOT mean that the writer must look at tags and codes. While in most applications tags are visible on demand, writers usually work in SGML-aware editors in WYSIWYG fashion.)

A major strength of SGML is the ability to produce as many different styles, looks, and formats as needed from the same SGML-tagged text without manual intervention and without the errors that may introduce. A variety of presentation formats can be produced from the same information, such as: a public version with unclassified information and a restricted version with additional material; a printed book with CD-ROM and Web versions; or a repair manual in voice synthesized, desk reference, and pocket versions.

Information Reuse

The same SGML text can be used in many different publications, without rekeying or re-coding. Data lives throughout the life-cycle of a product or service. Functional specifications, written in SGML, can be used by designers, who modify the information and pass it to testers, who pass their results to the production shop and technical writers, who create documents and databases for the showroom sales force and customers. SGML makes it easy to repurpose information; producing both a text book and a teacher's guide, an encyclopedia and several subject-specific handbooks, repair manuals and sales literature customized for the product or for a specific user.

How Does SGML Work?

Documents are divided into useful, named pieces, called "elements". Elements can be structures such as titles, sections, paragraphs, and figures or pieces of important content such as drug names, side-effects recommendations, summaries, or part numbers. Codes called tags (that are both human and machine-readable) are embedded in the data to identify the beginning and the end of each element. Tags can also contain additional information about the element (such as security level, revision date, or source). Each tag is named and defined in an information model called a DTD (Document Type Definition). DTDs provide the rules for structuring SGML data the way that database schemas provide rules for structuring databases. SGML parsers are software that can verify that information is structured and tagged according the rules in a DTD. Manipulation and display is based on the tags; the information inside <example> may be indented, <title> may appear big, bold, and centered, the contents of <index> may be alphabetized and linked into the text creating an index, the contents of other tags replicated as a table of contents or in running heads in a print document.

Applications of SGML

SGML is a system for creating tags; SGML is not a particular set of tags. Many shared sets of tags have been created using SGML, and many users create tags to meet their individual needs.

Since SGML is often used for interchanging information, it is convenient for the people and organizations who need to interchange data to agree on their textual markup. In creating an SGML application, groups can create a format for data that they can all produce and use, even if they use different hardware and software systems. Applications of SGML include:

  • HTML -- The most widely used application of SGML is HTML, the data format behind most of the pages on the WWW and intranets around the world. To use HTML, tags are embedded in data (W3C provides the tag set) that drive hypertext links and browser formatting. SGML and XML let users create their own organization- or industry-specific tags and use them, not just to make hypertext links work and display on browsers, but for document security, version control, context-specific searching, and much, much more. The tags are under user control and can be designed to meet user needs.
  • DoD CALS -- The United States DoD has developed a suite of SGML applications, called the CALS standard. (Many parts of this standard, including IETMs (Interactive Electronic Technical Manuals) have been adopted by the defense agencies in Canada, Japan, Australia, and many European countries.)
  • Industry Standards -- Among the industries who have developed shared SGML applications and DTDs are: airline and aircraft, telecommunications, automotive and trucking, semiconductor, newspaper, and publishing. (A medical standard is under development.)
  • Government -- Government agencies and departments among the first to increase the value of their information with SGML include: the Intelligence Community, SEC, DOE, the House and the Senate, NIST, NLM, IRS, GPO, PTO, FDA, and CRS. Foreign SGML users include NATO forces, the World Court, and the Swedish FDA.
  • Academic -- Academic projects run the range from coding all known works in ancient Greek to analyzing the influences of popular culture, to tracking wording changes in the Christian Bible.

Major commercial SGML users include: Caterpillar, Novel, Microsoft, John Deere, Sun, McGraw-Hill, Silicon Graphics, RR Donnelley, The Boeing Company, Walt Disney Imagineering, Nortel, Ricoh, Siemens, and Medical Economics; each of whom has created tags and full SGML applications to meet their needs.

SGML is highly compatible with both existing and especially SGML-created databases: the objects stored in a database can also be SGML elements; SGML can function as an output and interchange format from a database; and SGML-tagged data can be used to load a database.

The Bottom Line

Is SGML for everyone and everything? Of course not! There is always a faster, cheaper, easier way to produce any one information product than using SGML. Most organizations get into SGML when "one thing" is no longer enough. No technology short of SGML will support producing searchable pages on the Web, and a full-text search system on CD-ROM, and both full-sized and pocket-sized print editions, and Braille and voice-synthesis deliveries, and creation of client-specific documents on the fly, and instant updates to a parts catalog database which are reflected instantly in print and electronic sales literature. SGML is up to the challenge!


A Glossary of XML/SGML Terms and Acronyms

What Does "SGML" Stand For?

Standard
SGML is an International Organization for Standardization standard, is one of the CALS standards, and is recognized by ANSI and NISO.
Generalized
Not specific to any machine, operating system, software package, or type of information; non-proprietary (not owned, maintained, or controlled by a vendor); not one set of tags.
Markup
Markup is the information added to information to make the information more useful. SGML tags are markup inside information that identify parts of the information.
Language
Not a natural language (like Spanish or English); not a computer language (like Java or C++); SGML is a meta-language, a system of rules for creating markup.

What Do Those Hard Words Mean?

Document Type Definition (DTD)
A formal model of the structure of information; defines the tags permitted in the documents and describes the relationships between elements.
Document Instance
The textual data, including the tags that identify various parts of the information. (The tags are between the "<" and ">" symbols!)
Output Specification
A description (such as a style sheet) of one of the ways the document should look to or behave for the end user.
Rendered Document
An SGML document in one of the forms in which it will be used.
In-process Document
Most tools for creating and editing SGML documents have user-friendly front ends that make the creation and editing of the information much easier than actually looking at <tags>.

What Does That Acronym Stand For?

SGML
(Standard Generalized Markup Language) The international standard for encoding information based on the structure and content of the information. SGML data is platform-independent, vendor-independent, and media-independent so it can be used and reused in a wide variety of applications.
HTML
(HyperText Markup Language) The most common language for creating pages and forms to display on the Internet. HTML, as many organizations use it, is an application of SGML.
XML
(Extensible Markup Language) An extremely simple dialect of SGML designed to be served, received, and processed on the Internet's WWW.
W3C
(World Wide Web Consortium) Major consortium of vendors and organizations that sets the rules for the WWW. The people who brought you HTML.
ISO
(International Organization for Standardization) International body that writes and sponsors many multinational standards like SGML (ISO 8879). The people who brought you ISO 9000.

Mulberry Home Page
Mulberry home page

© Mulberry Technologies, Inc. 1997