Search and Taxonomy
CTC at the US Department of State
Providing expertise in Search and Taxonomy
200+ Websites under the management of the IIP/Web office are now powered by the state of the art Google search technology. Behind the simple yet magical search box, maintaining search services for each of these Websites is like running a mini orchestra.
Constant coordination is required among the content management/search team, the content owners, the site administrators and the product vendor. Many perspectives are taken into account in order to effectively support the world of dynamic content. The recent successful story of making treaty PDFs searchable is an interesting one to share.
The Office of the Assistant Legal Adviser for Treaty Affairs serves as the principal U.S. government repository for U.S. treaties and other international agreements. These treaties and agreements are published in PDF format. They are browsable in a dedicated section of the State Department public Website along with HTML style landing pages.
However, as with most Web sites, the physical structure in which the treaty Web content is stored favors navigation rather than search indexing. While the HTML pages are physically grouped together, the PDF documents reside under another huge directory inter-mixed with PDFs on other subject matters.
The search team understands that making these PDFs searchable is critical. It is important to develop a solution that not only solve the problem at hand with minimal system impacts but is also maintainable in the long term with existing resources.
Working closely with the customer users, the site administrator and the search vendor, the team investigated 7 different approaches. After many hours of prototyping, researching pros and cons as well as multiple long email trails, the team finally identified the best solution. The selected implementation creatively utilized the virtual path computing concept to logically extract out all treaty PDFs. This approach eliminates the need to physically change the existing Web link structure and therefore minimize impacts on existing system. It also maintains the integrity of the search service architecture by setting up the process on the content server rather than force fit the search servers for one unique situation. The PDFs are then included into the treaty search index together with all treaty HTML Web pages. The end result is a much more powerful search tool that the public can use to search for treaty information in different formats.