Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization parameters should NOT be coupled with model definition #326

Open
DIXLTICMU opened this issue Aug 26, 2020 · 11 comments
Open

Comments

@DIXLTICMU
Copy link

DIXLTICMU commented Aug 26, 2020

I would like to call out that setting global normalization parameters along with LTR model definition (#292) is, IMO, WRONG and especially for dynamic (query-dependent features).

The mechanism introduced in this PR assumes the values of the feature values for each feature type follows the exact same distribution, but this is incorrect. They are expected to follow the same distribution family, but not the exact same distribution.

For example, suppose we have WSJ as our corpora, "TF/IDF" as the feature type, and consider 2 different queries:

  1. "country"
  2. "Zimbabwe"

Obviously, these 2 queries are expect to have very different TFs as "country" just tend to appear more than "Zimbabwe", so it tends to have higher TFs. And in terms of IDF, it is obvious that the document frequency of "Zimbabwe" is gonna be far lower than "country". As a result, the TF/IDF values for the 2 queries are going to be significantly different between the 2 groups of matching documents wrt each query. you may have the feature values for the documents returned for query NO. 1 be like [0.007, 0.005, 0.003, ...], and for query NO. 2 be like [5, 4, 3, etc...]

As you can see that the magnitudes are vastly different for the same feature type, depending on the query and the corpora. If you wonder why? The answer is: they belong to 2 different processes, much like drawing numbers from 2 different dies, one have number 1-6, the other have numbers from 1000-6000. They both yields normally distributed numbers (technically it is multinomial distribution, but can be approximated by Gaussian), but they are technically 2 different distributions... This is what I meant by they belong to the same distribution family, but not exactly the same distribution.

  • If we follow the current normalization strategy, where a set of global normalization parameters are configured, how are we going to properly normalize the feature values coming from the 2 queries?

Therefore we should not use a global set of normalization parameters to do normalizations, at least not for dynamic features. We should still define what kind of normalization we are going to do for each feature type in the model, but not the actual normalization parameters as they are to be determined during runtime.

I understand this isn't much of an issue for static (meta) features such as "number of starts for reviews", but many LTR features are query dependent, and therefore they are not going to be handled properly given our current version.

I understand the technical challenge of doing normalization in a "micro" way such that the normalization parameters are different for each query, but I believe it is still do-able.

  • For min-max normalization, we can simply do a pre-fetch on the feature values and find the min/max, and then send those as dynamic normalization parameters to generate normalized feature values.
  • For z-score normalization, we can do sampling to find the mean and stddev.

The risk of resorting to the current normalization strategy is that it is subtly misleading, and often will not generate the ideal outcome as expected.

Normalization a very important aspect in the academia and a practical issue in the industry, if we don't do this right, we may lose lots of potential users working in IR.

@DIXLTICMU
Copy link
Author

@softwaredoug

@softwaredoug
Copy link
Collaborator

Thanks @DIXLTICMU ! You bring up some good points and great feedback.

I would disagree it’s misleading, it’s pretty explicit that a given model will use a specified min max or mean / std deviation before evaluating the model. And this is also what Solr LTR does.

Also, I can think of query-dependent counterexamples to the “country” vs “Zimbabwe” example you cite. Where instead of following the query-level distribution, you would expect the global one. So I think it’s going to be rather feature-specific decision.

Consider a commute distance between the user and a job. It may be that commute distance preferences are NOT dependent on where I live. A 30 minute commute is perhaps globally the mean in desirability. A user searching for jobs in city A where the only options are > 1 hour commute would want that factor treated less relevant, even though the commute distance feature is query-dependent. A user searching in city B, with some ~10 minute commute options would value that feature value very highly. So the “impact” of that feature, though query-dependent, depends on the global distribution, not the within query distribution.

However, a query-dependent or dynamic feature normalization could be a valuable feature to see what matched above / below average for your query. Something that looked like

   “scaled_min_max”: {
         “query”: {
              <inner query>
          }
   }

There may already be a way to do this with an existing Elasticsearch query, but I haven’t found it.

@softwaredoug
Copy link
Collaborator

Just as a reference Solr has a scale function that does a min/max scaling https://lucene.apache.org/solr/guide/6_6/function-queries.html

@DIXLTICMU
Copy link
Author

DIXLTICMU commented Aug 27, 2020

@softwaredoug , thanks for the quick response!

Yeah I think your example extends the scope of "query dependent feature" a bit... And given the Solr LTR practice, I will rephrase my argument into that "Resorting to a constant set of global normalization parameters does not provide appropriate normalization support for certain query-dependent features such as TF/IDF".

The commute distance example you raised is an interesting one... it is query deponent yet in this particular case, a global normalization strategy based upon the ~30 minute commute mean might be practically useful. Although I would still claim from a statistical perspective that commute time based on different cities yields different distributions.

That being said, features such as TF/IDF remains inappropriately treated with just on set of global normalization parameters. AFAIK, most IR features that addresses textual relevance will suffer from such deficiency in normalization support. See Microsoft's LTR feature bank (https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf) from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/

I am not sure about the scaled_min_max syntax though... I couldn't find it in the wiki... There is some scaling support for function score query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html), but alas they are not really dynamic either... as the scaling factor are assumed constants and prerequisites for issuing such queries. So using them is no different than assuming a fixed global set of normalization parameters...

I would be surprised that ES actually have an existing support for this, cuz this requires aggregation on top of all the scores collected from all the shards, which is not very typical.

Again, I don't think it is that difficult to implement... ES still supports metrics aggregation
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-aggregations-metrics-stats-aggregation.html

And if that includes support for script fields (https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-fields.html#script-fields), this will allow us to fetch stats such as min, max, mean, stddev and other useful information for normalization in a unified and generic way, regardless of the complexity of a predefined feature.

The good thing is that such stats can be deduced on top of of data collected from multiple shards, so we can still make it computationally efficient and utilize ES as a distributed system. We can either:

  • deduce stats useful for normalization on each shard, then normalize all the feature values within each shard locally, and merge them together in the final phase where all feature values from all the shards are merged into a set
  • deduce partial stats useful for normalization on each shard, and combine such stats from all the shards to come up with a set of global normalization stats. And then normalized all the feature values after they are merged together.

The second option is much slower but guarantees accuracy, the first option is much faster yet may be slightly less accurate depending on data size.

@nathancday
Copy link
Member

Welcome to the project @DIXLTICMU. We are always happy when new people join the community and we strive to maintain it as welcoming and respectful. Lots of people including @softwaredoug have spent a lot of time and energy building this plugin, please consider that when you offer constructive criticism.

Are you suggesting to re-calculate normalization parameters each time a query is called based on the index statistics for that field at that time?

@DIXLTICMU
Copy link
Author

@nathancday , I am glad to help! Even though my Java might has gone rusty and probably degraded into ellipsis... : )

Are you suggesting to re-calculate normalization parameters each time a query is called based on the index statistics for that field at that time?

Yes, I would like to introduce an optional feature normalization setting called dynamic, and attempt to, for each query,
collect statistics for the target feature value population across all shards and deduce normalization parameters.

I will try and submit a PR once I got the chance...

@nathancday
Copy link
Member

Interesting idea. It has been my understanding that normalization parameters are estimated when the model is trained and those parameters are then used directly on test/eval/validation data and not re-calculated. Because normalization changes the scale of a given feature, and the model does it's optimization on that new scale, it worries me that model performance would degrade if the scale is changed without the model being retrained. If the distribution for a given feature is substantially different in the production index from the training data, then I think that indicates it's time to generate new training data and a new model that's a better approximation for the production data.

But this is applied machine learning, and not pure statistics, so if performance is improved then hooray! I think you are correct to keep execution time as a central consideration of a "dynamic" option. It does feel like it would add significant overhead for execution and we already use re-rank queries just to handle the required feature queries in a reasonable amount (for production search response) time.

@DIXLTICMU
Copy link
Author

DIXLTICMU commented Sep 1, 2020

@nathancday, I believe there is some misunderstanding here about "query time normalization violating model training assumptions". This is simply not true... I am not attempting to introduce inconsistencies between offline model training and realtime model application.

normalization parameters are estimated when the model is trained and those parameters are then used directly on test/eval/validation data and not re-calculated.

Based on my best knowledge as an IR researcher, this is not the case... In a conventional setting (without using ES-LTR), raw feature values are always pre-processed before sending them to model training, this is where normalization happens, and for most query-dependent features, the numbers scale wrt the query, and this is where query-specific normalization is required and therefore there exists no single set of normalization parameters for normalization all the feature values across different queries..

Because normalization changes the scale of a given feature, and the model does it's optimization on that new scale.

Again, conceptually this is done before model training, and the model can simply assume that the feature values ranges from 0 to 1.

You may look into some benchmark LTR training data from Microsoft, study the distribution of the feature values for different queries, and get a clearer picture.

But this is applied machine learning, and not pure statistics

I do want to argue that my approach (query specific normalization) makes more statistical sense for query-dependent features than assuming a single global set of normalization parameters for all the feature values. The latter assumes the same distribution for features across different query, but the reality is that they, in spite of being under the same distribution family, do not follow the same distribution (distribution parameters suchs mean, stddev are different).

It's OK to assume a global set of normalization parameters for features such as 5-star rating, commute distance, document length etc, but not for features such as TF-IDF, BM25, Dirichlet prior smoothing scores, etc.

So what happens after we introduce the dynamic query-time normalization to ES-LTR?
There is actually one more thing I forgot to mention, and that is to allow definition of normalization strategy on a per feature level, in addition to model level. This will allow ES to do normalization when logging features, and after that, we don't need to worry about normalization ever again.

In the long run I would advocate for deprecating defining feature normalization on a model level, because feature normalization should be a concept tangential to what model we choose. Once the normalization has been coupled with a defined feature, then we can safely assume that everything comes after that are normalized and no more data processing is needed.

The current way of defining normalization along with model definition creates a situation where user must do his own normalization before training the model with ranklib or xgboost, which IMO is extra work that is avoidable.

Also, I will be on the look out for runtime performance : )

@nathancday
Copy link
Member

The current normalization support is at the feature level.

All normalization requires knowing some descriptive statistics about a feature's distribution/range. These estimations are done on training data and that training data will likely include query dependent features for LTR applications. It does not matter if a feature is query dependent or not, the normalization params for that feature will be calculated from the data observed for that feature in the training data.

I don't understand how you propose to do "query-specific" normalization without using previously calculated values.

All models in the context of this plugin are trained outside of ES and then ported/defined in ES to be utilized in production search. Data pre-processing is a critical component of a model's definition that must be coupled to it. Imagine a model trained on z-score normalized features and then feeding that model new data that has not been pre-processed and is still on the orignal scale. That is a recipe for poor performance.

@DIXLTICMU
Copy link
Author

DIXLTICMU commented Sep 7, 2020

@nathancday , I believe there is a gap in terms of perspectives between us, and any sort of mindset is capable of imperfection. I understand it is lengthy, but please take a moment to read through my replies.

The current normalization support is at the feature level.

Architecturally it is and I know that, but what is currently exposed to users is still that normalizations are configured when defining model: https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/training-models.html#creating-a-model-with-feature-normalization

All normalization requires knowing some descriptive statistics about a feature's distribution/range.

I am not debating the validity of this. I am only arguing that the "descriptive statistics" should be calculated on a per query basis. And I must reiterate that the certain features do NOT behave the way you presumed. Please see example below.

These estimations are done on training data and that training data will likely include query dependent features for LTR applications. It does not matter if a feature is query dependent or not, the normalization params for that feature will be calculated from the data observed for that feature in the training data.

I must contest this... It really does matter

Again... depending on your query, certain features especially those measuring textual relevances such as BM25, JM, Dirichlet prior smoothing score will scale very differently in magnitudes... applying the same normalization on top of them only makes smaller feature values very indistinguishable to the model during training... think about the following example with z-score normalization:

  • query NO. 1 with feature values on a set of retrieved/training documents as {0.01, 0.02, 0.03}
  • query NO. 2 with feature values on ANOTHER set of retrieved/training documents as {1, 2, 3}
  • query NO. 3 with feature values on YET ANOTHER set of retrieved/training documents as {100, 200, 300}

And for simplicity let's suppose the degree of relevance is positively correlated to those feature values.

If we assume the same normalization parameters calculated based on the totality of the training data, and ignoring the fact that those values vary drastically across queries, you will then have a mean of 67.34, and a stddev of 111.34, and then your normalization will simply transform the data into the following:

  • query NO. 1 -> { -0.60467, -0.60458, -0.60449} # notice how barely noticeable they are
  • query NO. 2 -> {-0.59578, -0.5868, -0.57782}
  • query NO. 3 -> {0.29331, 1.19138, 2.08945}

Now, do you believe the above is the best way to preprocess your data for training the model? Is your model that smart to reason with those values all together? Or will this potentially become "a recipe for poor performance"?

I don't understand how you propose to do "query-specific" normalization without using previously calculated values.

If you follow my idea of doing query time normalization, the above data will be normalized as such:

  • query NO. 1 -> { -1, 0, 1} # sample mean is 0.02, stddev is 0.01
  • query NO. 2 -> {-1, 0, 1} # sample mean is 2, stddev is 1
  • query NO. 3 -> {-1, 0, 1} # sample mean is 200, stddev is 100

Now that the normalization accurately identifies which document is weaker or stronger based on a reasonable assumption that the feature values for the document wrt the same query follows a unique distribution instance and a unique set of normalization parameters.

And yes this means we need to calculate mean and stddev on the fly during query time.

Data pre-processing is a critical component of a model's definition that must be coupled to it

I am not debating that, although I would point out that the model itself does not care about how you preprocess the data, as long as the data it sees during application are generated and processed the same way as it does during training.

Before training the model you can totally perform an extra level of normalization based on the feature values that are already normalized on a per-query level, but you will unlikely change the story by that much. This is because the feature values across different queries follow the same distribution family (E.g. Gaussian distribution), albeit following different distribution instances (Gaussian distribution with different means and stddevs). And once you perform normalization on a per query basis, they will be reduced to following the same distribution instance, so there is no need to do normalization again on the entire training data set cuz it is unlikely to have a practical impact.

Yes, there are indeed certain features with values that following the exact same distribution instance regardless of query, such as "5-start customer reviewers", "data recency", "commute distances", etc. And the current normalization works perfectly on those features. My point is that it can definitely be improved and extended. Also, I would point out that they are not the most important LTR features...

Think about all the textual relevance features listed in the Microsoft's LTR feature bank (https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf), if we do not treat those features with the proper normalization, we are going to discount the value of this product grievously.

@nathancday
Copy link
Member

I think I finally understand your normalization suggestion, thank for your descriptions and patience.

Do you think we could store a painless script to achieve this? We just implemented the termStats query to allow custom functions as features and I'm wondering if that's similar to what's required here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants