2004 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2004 Table of Contents 


USING WEB SITE INTERCONNECTIVITY TO FIND CLUSTERS OF ACCESSIBILITY PROBLEMS

Presenter(s)
P. Matthew Bronstad
John Slatin
Accessibility Institute and Department of Psychology
The University of Texas at Austin
Email: bmatt@mail.utexas.edu

1. ABSTRACT

Computer algorithms, like those used by Bobby (2002), are used to help make Web pages accessible to people with disabilities. They often work by automatically inspecting Web pages' source code for clear violations of accessibility guidelines. Although these algorithms often use crawlers to find web pages within a site, they do not make use of site structure to help webmasters make repairs. This study finds that Web pages' errors can be predicted by knowing the errors on a connecting Web page or a Web page with a similar URL. Further, the most inaccessible pages tend to cluster (that is, are highly interconnected) on Web sites. These findings suggest that knowledge of Web page connectivity is useful for making Web sites more accessible.

2. INTRODUCTION

Web sites are made up of webpages that are similar to each other. The similarities exist for many reasons, such as: one individual created every page on the site, different authors borrowed code from one another, or many authors used a template to create many pages. In fact, a respected usability guideline is that pages on a Web site should have a similar look and feel.

At the Accessibility Institute (formerly ITAL) at The University of Texas at Austin, we have known that checking the accessibility of a Web site's home page indicates problems one might find on pages deeper within the site. If, on the home page, alt text is not present, text boxes are not labeled, tables do not have headers, and so on, we usually find that the problem is endemic.

The Accessibility Institute's mission is to make UT web more accessible to people with disabilities. We face several problems. First, UT web is massive. Current estimates are that the Web site has several million pages. We can't possibly make the millions of pages accessible to people with disabilities. Second, UT web is sprawling, with Web sites representing many diverse academic departments, administrative offices, and institutes. We can not make changes to web pages without permission from webmasters and their supervisors. Accordingly, we want to make our efforts maximally effective and to encourage those, who have authority over subsites, to work with us to make sites more accessible. In principle, any individual or company has limited time and resources and therefore is in a similar circumstance.

As 1) pages on sites seem to be similar to each other and 2) despite the "messiness" of UT Web as a whole, it has hierarchical structure as well as hidden structure (e.g., UT webmasters borrow code from other UT webmasters, some UT webmasters code several distinct sites), we thought that the pattern of Web site connectivity is potentially useful. Google, for example, found that information value is extractable from patterns of connectivity on the internet (Page, Brin, Motwani, & Winograd, 1998). It is possible that clusters of accessibility problems could be extracted from Web site connectivity.

3. METHODS

Hypotheses

  1. Accessibility errors are predictable from errors on connecting pages. Similarity of errors should decrease as distance between pages increases.
  2. If the first hypothesis is true, then problematic web pages should cluster on sites.

Sites
Web sites were chosen to represent diversity in site organization, purpose, and audience. We selected 11 commercial, corporate, and academic sites.

Measures

  1. Web site interconnectivity: A crawler recorded connections among pages in Web sites.
  2. Accessibility Errors: Bobby (http://www.watchfire.com) generated automated accessibility reports for pages identified during the crawls. The smallest number of pages analyzed on a single site was 80, the largest 1037. Similarities between each pages' entire error report, and (within each site) every other pages' entire error report, was calculated to lower the dimensionality of the data.
  3. URL name similarity: the degree of similarity among url names was determined by calculating the proportion of common characters between two urls up to the point they diverged (for example, "http://www.utexas.edu/research" and "http://www.utexas.edu/search/a" diverge after 15 characters and thus are 15/23 (65%) characters identical.

Analyses

Error similarity was predicted using analysis of covariance (ANCOVA) on subsamples of page comparisons within Web sites (e.g. 2000 to 10000 comparisons). Analysis of problematic page clusters included two steps:

  1. Selecting pages that exceeded a particular number of errors. The criterion for selection was adjusted in five steps, from an average of 80% of pages on a site to an average of 5% of the pages selected from a Web site.
  2. Determining the number and size of clusters in those problematic pages (two pages were assigned to a single cluster if one linked to the other).
  3. The largest error cluster was compared to a distribution of maximum cluster sizes of random sample of pages from the same site. The samples were of the same size as the number of problematic pages identified in step 1.

4. RESULTS SUMMARY

Both connection distance and url name similarity predicted error report similarity. Additionally, in 8 of 11 sites, pages one link distant were more similar in terms of accessibility reports, than pages two links distant.

Problematic pages were found to form larger clusters (ie be more interconnected) than one would expect based upon the site structure. Out of 54 performable analyses (11 sites by 5 criteria, less one condition), in 48 cases (89%) the problematic pages formed larger clusters of pages than did the randomly selected pages. In 21 of the 54 analyses (39%), the problematic pages formed larger clusters than 95% of the largest clusters formed by randomly selected pages. Moreover, as error criterion was made stricter, the probability became greater that the problematic pages would form larger clusters than randomly selected pages.

5. CONCLUSION

First, within a Web site, a webpage's accessibility errors can be predicted from an adjacent page's error report. Second, the pages with the worst errors tend to cluster on Web sites. If adjacent pages are similar in types of errors, then connected pages (clusters) that have the worst problems should have similar errors. A webmaster faced with a large Web site could use similar analyses as guidance to find starting points for Web site repair.

A reasonable way to begin maintenance on any type of system is to find an error source that creates the largest number of problems. Given that all problems are equally easy to repair, fixing the most pervasive errors should give the best returns on the time invested. The results of this study indicate that it is possible to construct mechanisms to direct attention to parts of Web sites that most need of repair. Such strategies could help maximize effort spent reducing accessibility problems on large Web sites.

6. REFERENCES

Bobby 4.0.1 [computer software]. (2002). Waltham, MA: Watchfire.

Page, L., Brin, S., Motwani, R., & Winograd, T. The PageRank citation ranking: Bringing order to the Web. Stanford Digital Libraries Working Paper, 1998.


Go to previous article 
Go to next article 
Return to 2004 Table of Contents 
Return to Table of Proceedings


Reprinted with author(s) permission. Author(s) retain copyright.