2001 Conference Proceedings
Go to previous article
Go to next article
Return to 2001 Table of Contents
Transcoding System for the Non-Visual Web Access (1) -
Automatic Transcoding -
Hironobu Takagi
takagih@jp.ibm.com
Chieko Asakawa
chie@jp.ibm.com
IBM Japan Ltd.
Tokyo Research Laboratory
1623-14, Shimotsuruma
Yamato-shi, Kanagawa-ken 242-8502
Japan
Introduction
The Internet offers a new information resource for blind computer
users. On the Web, there are huge amounts of data of various
kinds and visitors can get any kind of information from all over
the world whenever they want it. In addition, they can
communicate with others through e-mail. The Internet has been
offering a drastically different type of information resource to
blind people. These days, however, Web pages are becoming
increasingly visual, using JavaScript, layout tables, frames,
images and so on, since Web authors are paying more attention to
the visual appearance. This makes many Web pages inaccessible to
blind users.
Our research goal is to make the Web a better information
resource for the blind, so even a computer novice can easily
access the Web. Unfortunately, many of the news sites' visual
representations on the Web make non-visual Web access much more
difficult. To solve this problem, the Web pages themselves should
be simplified and made more easily accessible for blind users.
Therefore, we decided to transcode inaccessible pages to be more
accessible before they come up on the client side. For the first
step in this direction, we focused on news sites and Web search
engines that are very popular and often used by any user. Our
approach is divided into two components. One is to simplify a Web
page by locating the differences between two HTML documents. This
enables users to refer only to newly updated information. For
example, users can easily track down today's articles. The system
can even use the network to find a page to be compared with for
differences, even when there is no comparable page in the cache.
The other component is based on rules from our experiences with
non-visual Web access. For example, one rule is to insert
necessary information such as missing alternative texts for image
links. Other rules are to move image maps and forms to the bottom
of a page and so on. These approaches are for general pages, so
they can be used without regard to the language, and any page can
be transcoded by our system.
In this paper, we will describe an overview of our system. After
showing examples of transcoded pages, we will offer some
conclusions and plans for the future.
Automatic transcoding
Architecture
Figure 1 -- Architecture of Automatic Transcoder
Figure 1 shows the system configuration. Users can access the
Internet seamlessly by using our system as a proxy server, only
changing the proxy settings of their browsers. When the proxy
server receives an HTTP request from a user, it gets the target
HTML file. The simplification module removes the layout
components, such as index lists, banners and image maps form the
page. The file is then sent to the Insert ALT module. This module
performs the function of inserting missing alternative texts for
each image link or image map by getting the titles of the
destination pages.
Simplification based on calculating the differential
The simplification module performs the function of simplifying
each page by getting the differential between the target page and
its neighboring pages. At first, we implemented the system to use
a previous version of the same identical URL from the cache for
comparison with the target file. This allowed the system to
simplify pages by getting only the updated information on each
page. However, we found that this method is not useful with
newspaper sites or search engines. Since the pages linked to
their URLs are changed every day or in each search, the system
usually cannot specify a previous file from the cache.
For newspaper sites, the URL of an article often includes the
date and the month. For example:
http://www.cnn.com/2000/TECH/space/09/22/space.station.decor.ap/index.html
In this URL, "09/22" means September 22nd, and "
space.station.decor.ap" is an abbreviated title of the article.
This system means that there is no previous file with the same
URL, so there is no obvious comparison file to calculate the
differential from.
The result pages of search engines have the same problem. Each
URL of a search result includes the keywords of the search. For
example:
http://www.go.com/Titles?col=WW&qt=California+State+University
In this URL, "California+State+University" are the retrieval
keywords, so each retrieval request has a unique URL. If a
previous page for this URL existed, the system could calculate
the differential and get the new results, but in general we
cannot expect that a previous search with the same retrieval
keywords will exist in the cache database.
We needed to develop more general methods for specifying a file
to be compared with the target file. The system should have a
function to get pages for comparison, even when there is no
previous version of a page. Therefore we developed methods of
using not only previous pages, but also neighboring pages.
"Neighboring pages" refers to the pages that have the same
layout, indexes, banners and other frequently appearing elements.
We analyzed some major newspaper sites and search engines, and
developed some general rules to list up candidate neighbor pages
which may have a similar layout.
1. Pages in the same directory Example:
Target->
http://www.asahi.com/0423/news/national23008.html
Neighbor-> http://www.asahi.com/0423/news/national23010.html
2. Pages having the same parent directory as the target file
Example:
Target->
http://www.cnn.com/2000/TECH/space/09/22/plutoprobe.ap/index.html
Neighbor->
http://www.cnn.com/2000/TECH/space/09/22/space.station.decor.ap/index.html
3. The index file of each parent directory Example:
Target->
http://www.cnn.com/2000/TECH/space/09/22/plutoprobe.ap/index.html
Neighbors->
http://www.cnn.com/2000/TECH/space/09/22/ index.html
http://www.cnn.com/2000/TECH/space/09/ index.html
. . . http://www.cnn.com/index.html
4. For retrieval results, result pages using different keywords
from the cache database
For search engines, the system considers the result pages for
other retrieval keywords from the same search engine, since those
result pages usually have almost the same layout, except for the
actual results and some link lists or banners.
Example:
http://www.go.com/Titles?col=WW&qt=California+State+University
http://www.go.com/Titles?col=WW&qt=assistive+technology+blind
5. Previous pages from the cache database
The system uses not only neighboring pages, but also cached
files. The module searches the files in the cache database, and
selects any old files for the target URL. These files are added
to the candidate list
After collecting all the files as a list of candidates by above
methods, the simplification module calculates the differences
between the target HTML file and each file in the candidate list.
We are using the Dynamic Programming matching method (DP
matching) to calculate a differential file. But, it leaves
necessary tags for making it displayable, such as <head>,
<body>, <table>, <map>, <script> and so
on. After all calculations for each combination are finished, the
system selects the smallest file, which has the shortest text, as
the result page.
Automatic insertion of ALT attributes
The system automatically inserts missing alternative text for
image links and image maps by extracting the title of the linked
pages. In this way, a user can recognize linked pages indicated
by image links and image maps without alternative text.
Experience-based rules
Finally, the system applies experience-based rules. We will
describe two rules here. One is to move the main content of a
page to the top of the page. For this purpose, it moves image
maps and forms to the bottom of a page, even though they often
appeared at the top of the original page. This function will help
users to access the main content directly. Another rule is to
remove duplicated links. We often see such duplicated links in a
page. For example, one link is an image link and another is a
text link. Both of them have the same URLs as the anchor text, so
we decided to remove one of them.
Examples of Transcoding
Figure 2 -- Transcoding of ABC News
(a) Original Page (b) Transcoded page Figures 2 shows pages for
articles on television news Web sites. In the original page of
Figure 2(a), there are 26 links and one form above the article
text in the source HTML file. After simplification, most of them
are removed and only one link is left above the main content. In
this way, a user can easily read today's article without being
bothered by any regularly displayed information. Figure 3 shows
the result page of a search engine. The transcoded version again
removes most of the regularly displayed information that appeared
above the actual results. We have tested our system with many
other newspaper sites and some search engines, both in the U.S.
and Japan, and the results were very similar to these figures.
Most pages can be simplified to about two thirds of the original
file size after being transcoded by our system.
Figure 3 - Transcoding of Search Engine Results
(a) Original page (b) Transcoded page
Conclusion and plans
Our system can be used on the fly, without advance preparation,
since it can immediately find a page to be compared with on the
network by analyzing neighboring URLs. Our experience-based rules
also help to simplify a page more effectively for blind users.
Through evaluation, it is clear that our transcoder is very
useful both for news sites and for Web search engines, our
initial foci. We conclude that our transcoder will help blind
users access the Web more easily and quickly. We have implemented
the first step towards our goal, and we would like to keep
working in this direction and try to simplify various categories
of Web sites automatically. We also plan to add more rules for
further simplification.
However, we found some limitations in the automatic transcoding.
It sometimes removes too much information or leaves some
information that should be removed. And finally, sometimes blind
users need to read a whole page and understand it even though
they can't see it directly. To solve these problems, we would
like to extend our transcoder's capabilities by using a database
of annotations that could be provided by sighted volunteers. We
hope to keep improving the Web's accessibility with both
automatic and annotation-based transcoding capabilities.
Go to previous article
Go to next article
Return to 2001 Table of Contents
Return to Table of
Proceedings
Reprinted with author(s) permission. Author(s) retain copyright.