2001 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2001 Table of Contents


Transcoding System for the Non-Visual Web Access (1) - Automatic Transcoding -

Hironobu Takagi
takagih@jp.ibm.com 
Chieko Asakawa
chie@jp.ibm.com 
IBM Japan Ltd.
Tokyo Research Laboratory
1623-14, Shimotsuruma
Yamato-shi, Kanagawa-ken 242-8502
Japan

Introduction

The Internet offers a new information resource for blind computer users. On the Web, there are huge amounts of data of various kinds and visitors can get any kind of information from all over the world whenever they want it. In addition, they can communicate with others through e-mail. The Internet has been offering a drastically different type of information resource to blind people. These days, however, Web pages are becoming increasingly visual, using JavaScript, layout tables, frames, images and so on, since Web authors are paying more attention to the visual appearance. This makes many Web pages inaccessible to blind users.

Our research goal is to make the Web a better information resource for the blind, so even a computer novice can easily access the Web. Unfortunately, many of the news sites' visual representations on the Web make non-visual Web access much more difficult. To solve this problem, the Web pages themselves should be simplified and made more easily accessible for blind users. Therefore, we decided to transcode inaccessible pages to be more accessible before they come up on the client side. For the first step in this direction, we focused on news sites and Web search engines that are very popular and often used by any user. Our approach is divided into two components. One is to simplify a Web page by locating the differences between two HTML documents. This enables users to refer only to newly updated information. For example, users can easily track down today's articles. The system can even use the network to find a page to be compared with for differences, even when there is no comparable page in the cache. The other component is based on rules from our experiences with non-visual Web access. For example, one rule is to insert necessary information such as missing alternative texts for image links. Other rules are to move image maps and forms to the bottom of a page and so on. These approaches are for general pages, so they can be used without regard to the language, and any page can be transcoded by our system.

In this paper, we will describe an overview of our system. After showing examples of transcoded pages, we will offer some conclusions and plans for the future.

Automatic transcoding

Architecture

Figure 1 -- Architecture of Automatic Transcoder

Figure 1 shows the system configuration. Users can access the Internet seamlessly by using our system as a proxy server, only changing the proxy settings of their browsers. When the proxy server receives an HTTP request from a user, it gets the target HTML file. The simplification module removes the layout components, such as index lists, banners and image maps form the page. The file is then sent to the Insert ALT module. This module performs the function of inserting missing alternative texts for each image link or image map by getting the titles of the destination pages.

Simplification based on calculating the differential

The simplification module performs the function of simplifying each page by getting the differential between the target page and its neighboring pages. At first, we implemented the system to use a previous version of the same identical URL from the cache for comparison with the target file. This allowed the system to simplify pages by getting only the updated information on each page. However, we found that this method is not useful with newspaper sites or search engines. Since the pages linked to their URLs are changed every day or in each search, the system usually cannot specify a previous file from the cache.

For newspaper sites, the URL of an article often includes the date and the month. For example:

http://www.cnn.com/2000/TECH/space/09/22/space.station.decor.ap/index.html In this URL, "09/22" means September 22nd, and " space.station.decor.ap" is an abbreviated title of the article. This system means that there is no previous file with the same URL, so there is no obvious comparison file to calculate the differential from.

The result pages of search engines have the same problem. Each URL of a search result includes the keywords of the search. For example:
http://www.go.com/Titles?col=WW&qt=California+State+University In this URL, "California+State+University" are the retrieval keywords, so each retrieval request has a unique URL. If a previous page for this URL existed, the system could calculate the differential and get the new results, but in general we cannot expect that a previous search with the same retrieval keywords will exist in the cache database.

We needed to develop more general methods for specifying a file to be compared with the target file. The system should have a function to get pages for comparison, even when there is no previous version of a page. Therefore we developed methods of using not only previous pages, but also neighboring pages. "Neighboring pages" refers to the pages that have the same layout, indexes, banners and other frequently appearing elements. We analyzed some major newspaper sites and search engines, and developed some general rules to list up candidate neighbor pages which may have a similar layout.

1. Pages in the same directory Example:
Target-> http://www.asahi.com/0423/news/national23008.html
Neighbor-> http://www.asahi.com/0423/news/national23010.html

2. Pages having the same parent directory as the target file Example:
Target-> http://www.cnn.com/2000/TECH/space/09/22/plutoprobe.ap/index.html
Neighbor-> http://www.cnn.com/2000/TECH/space/09/22/space.station.decor.ap/index.html

3. The index file of each parent directory Example:
Target-> http://www.cnn.com/2000/TECH/space/09/22/plutoprobe.ap/index.html Neighbors->
http://www.cnn.com/2000/TECH/space/09/22/ index.html
http://www.cnn.com/2000/TECH/space/09/ index.html
. . . http://www.cnn.com/index.html

4. For retrieval results, result pages using different keywords from the cache database
For search engines, the system considers the result pages for other retrieval keywords from the same search engine, since those result pages usually have almost the same layout, except for the actual results and some link lists or banners.
Example:
http://www.go.com/Titles?col=WW&qt=California+State+University
http://www.go.com/Titles?col=WW&qt=assistive+technology+blind

5. Previous pages from the cache database
The system uses not only neighboring pages, but also cached files. The module searches the files in the cache database, and selects any old files for the target URL. These files are added to the candidate list
After collecting all the files as a list of candidates by above methods, the simplification module calculates the differences between the target HTML file and each file in the candidate list. We are using the Dynamic Programming matching method (DP matching) to calculate a differential file. But, it leaves necessary tags for making it displayable, such as <head>, <body>, <table>, <map>, <script> and so on. After all calculations for each combination are finished, the system selects the smallest file, which has the shortest text, as the result page.

Automatic insertion of ALT attributes

The system automatically inserts missing alternative text for image links and image maps by extracting the title of the linked pages. In this way, a user can recognize linked pages indicated by image links and image maps without alternative text.

Experience-based rules

Finally, the system applies experience-based rules. We will describe two rules here. One is to move the main content of a page to the top of the page. For this purpose, it moves image maps and forms to the bottom of a page, even though they often appeared at the top of the original page. This function will help users to access the main content directly. Another rule is to remove duplicated links. We often see such duplicated links in a page. For example, one link is an image link and another is a text link. Both of them have the same URLs as the anchor text, so we decided to remove one of them.

Examples of Transcoding

Figure 2 -- Transcoding of ABC News
(a) Original Page (b) Transcoded page Figures 2 shows pages for articles on television news Web sites. In the original page of Figure 2(a), there are 26 links and one form above the article text in the source HTML file. After simplification, most of them are removed and only one link is left above the main content. In this way, a user can easily read today's article without being bothered by any regularly displayed information. Figure 3 shows the result page of a search engine. The transcoded version again removes most of the regularly displayed information that appeared above the actual results. We have tested our system with many other newspaper sites and some search engines, both in the U.S. and Japan, and the results were very similar to these figures. Most pages can be simplified to about two thirds of the original file size after being transcoded by our system.

Figure 3 - Transcoding of Search Engine Results
(a) Original page (b) Transcoded page

Conclusion and plans

Our system can be used on the fly, without advance preparation, since it can immediately find a page to be compared with on the network by analyzing neighboring URLs. Our experience-based rules also help to simplify a page more effectively for blind users. Through evaluation, it is clear that our transcoder is very useful both for news sites and for Web search engines, our initial foci. We conclude that our transcoder will help blind users access the Web more easily and quickly. We have implemented the first step towards our goal, and we would like to keep working in this direction and try to simplify various categories of Web sites automatically. We also plan to add more rules for further simplification.

However, we found some limitations in the automatic transcoding. It sometimes removes too much information or leaves some information that should be removed. And finally, sometimes blind users need to read a whole page and understand it even though they can't see it directly. To solve these problems, we would like to extend our transcoder's capabilities by using a database of annotations that could be provided by sighted volunteers. We hope to keep improving the Web's accessibility with both automatic and annotation-based transcoding capabilities.


Go to previous article 
Go to next article 
Return to 2001 Table of Contents 
Return to Table of Proceedings


Reprinted with author(s) permission. Author(s) retain copyright.