PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Avoid results from same domain

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Avoid results from same domain

    Hi I was wondering if there is a way to avoid having all results from the same domain. Most people only look the 10 first results and rarely go to the second page.

    When someone searches a word instead of a phrase it shows all results from almost the same domain. As I have over 3000 domains in the database that means every search always displays one or two domains only. I would like it also display more domains in each search for example limit results to display only 2 results per domain.

    Also is there a way to limit results on search to improve speed? Lets say I only want to show 50 results per search so there is no need for the engine to output 3000 or 6000 thousands results on each search as nobody uses them. So the logic says its ok if it only output 50 which will increase speed even more. Of course internally it still have to search the full database.

    Thanks

  • #2
    Originally posted by nibb View Post
    When someone searches a word instead of a phrase it shows all results from almost the same domain. As I have over 3000 domains in the database that means every search always displays one or two domains only. I would like it also display more domains in each search for example limit results to display only 2 results per domain.
    Zoom does not currently do this, although it has been asked a few times and we're considering it for a future version (possibly V7).

    Originally posted by nibb View Post
    Also is there a way to limit results on search to improve speed? Lets say I only want to show 50 results per search so there is no need for the engine to output 3000 or 6000 thousands results on each search as nobody uses them. So the logic says its ok if it only output 50 which will increase speed even more. Of course internally it still have to search the full database.
    By default, Zoom would not be displaying 3000 results. As you noted, internally it'd still have to search the entire collection, so it may have determined that there are 3000 matches... but this does not necessarily mean there's any extraneous processing here.

    What exactly do you mean by "output 3000 or 6000 results"? If you are actually displaying more than 50 results on the search page, then yes, you can reduce the number here to save alot of work. But the default is 10 results per page and this can be controlled with the dropdown on the search form.

    The processing work involved in determining the number of results found is insignificant, and part of the process to search through the full database.

    There is an Optimization control (under "Configure"->"Limits" in the Indexer). When it is pushed up to "Fastest (least accurate)" it will perform a limited number of seeks and matches before it tells the user something like, "the search words you are looking for is too common, please try something else".

    Having said all that, V7 will feature a new limit to adjust the "Max. results per query" (which is 1000 at the moment, same as Google's). So changing this would actually prohibit the search function from returning results beyond the limit specified. This doesn't help much in efficiency, just that it was a custom development feature for a user that wanted to request more than 1000 results in XML format for post-processing.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      If the limit is 1000 and reducing it will not improve things they it really doesnt matter as getting more milisecond out of performance would not be any savings.

      The thing is I ratter use Zoomv5 in a different way then most people. I use it to scan a niche of specific website which are not controlled by me so I cannot control which have higher priority but putting tags on them.

      And I cannot tell people to just search something else as it doesnt work quite that way. If someone for example searchs "Australia" I want to 10 websites that have something related to Australia to show up and not 10 results which are from the same domain and when it actually is related to Australia its not funny if always the results are from the same domain. I only care about the first 10 to 30 results and a solution would be to limit the spider to only get the top domain and not go deeper but that would leave the database poor and without any results. I ratter hit the 250,000 word limit which it currently has but still almost any searches doesnt seem to mix or at least give options to users. If all 20 results are from the same domain or only two then they are not 20 results but only 2 in case as people search for websites and not webpages. Google does the same, it will never show the same domain or url in the same query which is the way I want it to work so people can choose from a sort of results.

      I know most people use it internally but giving a little priority to people that use it to spider external sites would be great. I dont say to put a pagerank system on it (even if it would be quite easy as a formula) but how about a way to track which results people click or some other way to have a more diversification on the results page.

      The only idea I had is to manually improve the results but then there is only the suggestion feature which would work except for this problems:

      a) I would need to put allot and I mean allot of suggestions, for example at least for the top 500 queries I would suggest the best 3 urls. This would be a complete mess to handle as it would be to big to handle in Zoom interface as the suggestion feature seems to be made for ads ratter then improving manually results. If it would be scary to handle over 100 suggestions without loosing yourself.

      b) It would require a complete reindex on each new suggestion added. As my scan takes at least a few hours on a 100 Mbit connection on a dedicated server this ratter makes this feature impossible to use for what I want as I would need to keep adding suggestions all the time and doing a rescan for each url you add as suggestion would be to time consuming.

      If I could add or modify priority of some urls for some queries without rescanning it would be the solution or if I would just improve the overall results to be more mixed. Its like a little anarchy database where results come as they want without me having any type of control on which results I would like to put higher for specific words

      The solution to be more specific on queries would just mean Zoom doesnt work for searching things less then two words or even one word. As far as I know phrase searching should be optional not a requirement. My idea was to deploy the node option once it can handle over 500.000 unique words or at least over 4 million pages, but whats the point of having a huge database if the results query are always the same for different keywords. It seems the bigger the database gets the less accurate the results get. Being able to limit the results for domains could probably make the impression the results are more rich.

      Comment


      • #4
        We understand the situation, and as we said before, we are considering adding the feature to group results by domains.

        As you noted, Zoom was designed for internal searching originally (i.e. sites that you own or maintain), but with its increased capabilities, it has become more and more popular to use it to index many external sites. And this is one feature that would cater to such usage.

        The other features I mentioned were just available related options, in response to what you asked about. I too don't think they'd help your primary problem, but you asked about limiting results to improve search time. I was explaining there are features to limit, but the only type of limiting that would improve search time would be cutting back on the accuracy.

        You mentioned you're using V5 (please note you're posting in the V6 forum). There are some V6 features which cater to this usage already which you might not be aware of. I will elaborate on them later below.

        Originally posted by nibb View Post
        I dont say to put a pagerank system on it (even if it would be quite easy as a formula) but how about a way to track which results people click or some other way to have a more diversification on the results page.
        Tracking clicks would place more requirement on the server. The clicks would need to be stored either in a log file or a database in real-time (per query). This means a more complicated setup procedure and more demanding technical requirements in terms of permissions and resources.

        While it would be a nice thing to have, the technical requirements would restrict the use of this to a small group of our user base and thus its not something we would consider as being a practical addition at this point.

        Originally posted by nibb View Post
        The only idea I had is to manually improve the results but then there is only the suggestion feature which would work except for this problems:
        Manually manipulating the results for over 500 queries is not really a solution I would personally consider.

        Instead, you might want to take a look at the following:
        • V6 allows you to specify a different Weighting value per start point. Assuming you have 1 start point per domain, you can then lower the weighting of sites which are swarming the results and they'd be prioritized down (and only appear further up if they truly matched significantly).
        • Consider reducing the number of pages you index from each site. For each start point you can specify "Limit files for this start point". So you could just index 10 pages from a site which is a bit overwhelming for your index.
        • You could set a global "Limit files per start point" setting.
        • You could set a global "Limit words per file" setting, to only get the top portion of each page. This can reduce the impact of sites which have really long pages, which, again, swarm the results.
        • You can change the Weighting for "Content density" to "Strong adjustment" allowing a preference for smaller documents over larger documents (again having the impact of reducing "swarming")
        • Other weighting options like "Word position" and "URL length" may also help reduce swarming by increasing the impact of other factors.
        Originally posted by nibb View Post
        The solution to be more specific on queries would just mean Zoom doesnt work for searching things less then two words or even one word.
        No, the Optimization setting just means you eliminate the need to search extraneously (and put load on the server unnecessarily) when the user is not going to get good results anyway. When they are searching for words like "the what is" (assuming they are not skip words), it is going to get too many results for the result set to be meaningful, no matter the search algorithm. It will still return the results it's found, but it just decided that the rest of the results isn't going to be much better. All it means is to be more specific in their choice of terms so that the results would be more finite and manageable.

        Again, I'm not suggesting it as a solution to improving your results for multiple domains. But it was an answer to your specific question of limiting results and matches.

        Originally posted by nibb View Post
        It seems the bigger the database gets the less accurate the results get.
        Not true. Accuracy is exactly the same with bigger databases as it is with smaller databases. But the more data to search, the longer it takes,... that's just a fact. You asked if there was a way to cut the time it takes by reducing the work required. Naturally, we've already optimized anything we can that would not lose accuracy in the normal use. So the only areas left would be giving you control to cutting the actual searching performed, which is why I mentioned those features.

        What search times are you getting? Are you using the CGI version? What is your server hardware? How many pages are in your index?
        Last edited by Ray; Jul-06-2010, 03:53 AM.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          I get 0.008 or something miliseconds per search or something similar. I dont complain about speed, the CGI version works great on a 12 GB with 8 cpus. I have other projects on it so its not fully dedicated as im just testing. The problem is not speed but results which are ratter over helming. Maybe a ranking algorithm like this:
          http://orbitscripts.com/orbit-web-spider-features.html
          Would really help the results or maybe its really just not for external search. But to be honest I did not tested allot the weighting and boosting preferences yet. I really think that with enough patience and several spidering tests with different settings I could achieve what I want. Its just that I would maybe need to test it with a little database as I need to rescan when I change settings or preferences. The problem with limiting files per domain is that someone are really rich in content and others are not, so I cannot use a global setting here as I want to scan as much internal pages as possible on each domain or that my hardware supports.

          About the tracking of clicking on search results I dont agree with you. Its very easy to do with javascript, you can even track mouse movement with javascript and i was just mentioning which result position they where clicking. Javascript works on the browser side and sending a bit of data (to store the result) I dont think it will at all affect the performance as its like saying that loading an extra gif button will increase server load. The data send is almost irrelevant as it only sends one or two digits. Nothing that cannot be done with some ajax.
          Lastly my idea was, but not sure if its possible to have multiple datatabase, so I can have one with domains that dont have good results and I lower then priority on them and then have another one with a list of priority data which I need to scan at 100%, that means the full websites. Im just not sure if that would work, combine multiple database on searching.

          Comment


          • #6
            Originally posted by nibb View Post
            I get 0.008 or something miliseconds per search or something similar. I dont complain about speed, the CGI version works great on a 12 GB with 8 cpus. I have other projects on it so its not fully dedicated as im just testing. The problem is not speed but results which are ratter over helming.
            OK. It's just confusing because this was all brought on by your initial question:

            Originally posted by nibb View Post
            Also is there a way to limit results on search to improve speed?
            It's quite hard to write a meaningful answer to your question, and then you tell us you don't need what you asked for.

            Originally posted by nibb View Post
            Maybe a ranking algorithm like this:
            http://orbitscripts.com/orbit-web-spider-features.html
            Would really help the results or maybe its really just not for external search.
            I do not see anything they mentioned in their search algorithm that we do not already do. I would really suggest looking harder at the options I pointed out. We do use link backs in our V6 ranking algorithm. Again, if you're using V5, you're missing out on all that.

            I did look at their search demo, and they did do grouping of results by domains in their demo, and that's one thing we don't do yet.

            However, we don't need to run on a dedicated server. And with their "request price" model, and "index the entire Internet" scope, I suspect that is really more a product along the lines of a Google Search Appliance (which tend to be $70K upwards).

            Originally posted by nibb View Post
            The problem with limiting files per domain is that someone are really rich in content and others are not, so I cannot use a global setting here
            You can set this limit per site as mentioned in my last post. I suggest taking a break to read up on what I mentioned, as we're starting to go around in circles.

            Originally posted by nibb View Post
            About the tracking of clicking on search results I dont agree with you. Its very easy to do with javascript, you can even track mouse movement with javascript and i was just mentioning which result position they where clicking. Javascript works on the browser side and sending a bit of data (to store the result) I dont think it will at all affect the performance as its like saying that loading an extra gif button will increase server load. The data send is almost irrelevant as it only sends one or two digits. Nothing that cannot be done with some ajax.
            You're misunderstanding the requirement. I did not say anything about server load nor performance.

            The JavaScript can detect the click, but the click data needs to be stored on the server for the search function to use this data. It is this storage that is an additional requirement in the setup of the search engine. AJAX can send the data to the server, but the server still needs to store this data and retrieve it to use later.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              OK thanks, I will try V6 then as im with V5. I will tweak a little bit.
              I mean the Google Page Rank feature which calculates priority based on links pointing to a domain, etc. Its the "V6 Improved Search Ranking Algorithm " the one you mentioned? The page, title boosting, etc are in V5 as well. It only says there is a new improved algorithm. I guess I just have to try it out on a real testing because from the info page I could not extract to much information on how it works different then V5.

              Comment


              • #8
                The V6 ranking algorithm takes into consideration many more factors, as illustrated on the page I linked you to. "Links and ALT text" are one of the factors that influence how it ranks the page (along with other things like "Page depth", etc. The size of the circles are supposed to represent how much of an influence each factor have. These factors ("Links", "Page depth", "Word position", "word proximity") were not employed in the V5 search algorithm.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment

                Working...
                X