Skip to content

question about calculate_distance() #6

@kasiapekala

Description

@kasiapekala

I was wondering about calculate_distance() and after_cut object.

How it will behave if one wants to calculate_distance() for factor vectors with different number of levels?

For example, if a train and test samples are provided from outside of R. One of the factor variables in test sample (variable_new) is missing some types of values and it won't have the same number of levels as the variable in train sample (variable_old). Then, if I'm not mistaken, c() that creates after_cuts, will encode them differently than it should.

Example:

variable_old <- apartments[, 6]
variable_new <- filter(apartments_test, district != "Praga")[, 6]
variable_new2 <- droplevels(variable_new)

length(levels(variable_new))
[1] 10
length(levels(variable_new2))
[1] 9

calculate_distance(variable_old,variable_new)
[1] 0.092
calculate_distance(variable_old,variable_new2)
[1] 0.097

If that's indeed a problem, than maybe the change proposed below would solve it?

after_cuts <- as.factor(c(as.character(variable_old),as.character(variable_new)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions