Hi Dennis. I've got a small problem that's driving some of my members and myself nuts. Certain feeds (which happen to be my most important ones) have this problem that in some cases makes the post almost completely unreadable. Is there a way to strip those out of there? Here's a small example of part of one so you see what I mean.
Quote:The discussion and email about desktop search offered an opportunity for us to have a deeper architectural discussion about engineering Windows 7. There were a number of comments suggesting alternate implementation methods so we thought we’d discuss another approach and the various pros and cons associated with it. It offers a good example of the engineering balance we are striving for with Windows 7. Chris McConnell wrote this follow-up. --Steven (See you at the PDC in a week!)
Thanks for all the great feedback on our first blog post on Windows Desktop Search. I’ve summarized a number of points that have been made and added some comments about the architectural choices we have made and why.
Integration with the File System
As some posters have pointed out, one possible implementation is to integrate indexing with the file system so that updating a file immediately updates the indices. Windows Desktop Search takes a different approach. There are two aspects of file system integration: knowing when a file changes and actually updating the indices before a file is considered “closed” and available. On an NTFS file system, the indexer is notified whenever a file changes. The indexer never scans the NTFS file system except during the initial index. It is on the second point—updating the indices immediately when a file is closed that we made a different choice. Updating immediately has the benefit that a file is not available until it is indexed, but it also comes with a number of potential disadvantages. We chose to decouple indexing from file system operations because it allows for more flexibility while still being almost real-time. Here are some of the benefits we see in the approach we took:
Fewer resources are used. Inverted indices are global. An inverted index maps from a word found in a property to a list of every document that contains that word. Indexing a single file requires updating an index for every single unique word found in the file. A single document might then update a very large number of individual indices. Making these changes and committing them with the same robustness found on individual files would be very expensive. The design of the indexer allows scheduling and aggregating these changes so that much less work is done overall—that means less CPU and less disk I/O. The system can be more robust because indexing doesn’t only happen when a file is closed—and it can even be retried if necessary.
File system operations are prioritized over indexing. Getting files robustly updated and available is necessary for applications to use them. We don’t want to delay that availability by forcing the cost of indexing into file close operations. Searching over files is important, but is less important than actually working with files. We wouldn’t want applications to decide individually if the indexer should be turned on or off just because they were seeking the best performance with respect to the file system.
There are lots of file types. Microsoft supplies extractors (IFilter/IPropertyHandler) for many common file types as part of Windows. There are many other file types as well so it is important to allow non-Microsoft developers to write their own extractors. In Vista (and Windows 7), these extractors run in a locked down process that ensures that they are secure and do not affect the performance of the whole system. If indexing had to happen before a file was available, then an extractor could impact (intentionally or not) all file system operations.
Some files are more valuable to index then others. If indexing happened when a file is closed, then there is no control over the order files are indexed. Decoupling allows prioritizing indexing some files over others. For example, searching for music is much more likely than searching for binary files. If both music files and binary files have changed, then the indexer ensures it indexes the music files first. Some files are not worth indexing at all for most people. Several comments suggested that we should index the whole drive. We can do that—and for those who would find it valuable it easy to add folders to be indexed. (You can also remove them, but that is much less common so that is controlled through the control panel “Indexing Options.”) For most people indexing system files is just a cost—they would never search for them and would be confused if they showed up as the result of a search.
Not everything is a file in single file system. Windows is all about supporting diversity. There are many different file systems like FAT32 and CDFS and we would like to be able to search over those as well. If we integrated with only NTFS, then we would have to still have a loosely coupled system for other file systems. Many applications also have databases optimized for their own needs. For example, Outlook has a database of email. If only files were indexed, then the email in the database could not be indexed unless Outlook either compromised their experience by using files only, or complicated their implementation by duplicating everything in both the file system and the database.