home | copyright ©2016, tim@menzies.us
overview |
syllabus |
src |
submit |
chat
Make the case the Knn suffers from the following problems:
- They are computationally expensive (in terms of run time and memory)
- They are intolerant of attribute noise.
- They are intolerant of irrelevant attributes.
- They are sensitive to the choice of the algorithm's similarity function.
- They provide little usable information regarding the structure of the data
- There is no natural way to work with nominal-valued attributes or missing attributes
Aha's 1991 distance measure:
def distance(i,r1,r2,cols=None,f=None):
cols = cols or Table.DIST(i)
f = f or THE.edist
d,n = 0, 10**-32
for col in cols:
x, y = r1[col.pos], r2[col.pos]
if x is Thing.UNKNOWN and y is Thing.UNKNOWN:
continue
if x is Thing.UNKNOWN: x=col.my.furthest(y)
if y is Thing.UNKNOWN: y=col.my.furthest(x)
n += 1
inc = col.my.dist(x,y)**f
d += inc
return d**(1/f)/n**(1/f)- How does it handle missing values and non-numerics?
- What is normalization and what is its role in Aha's algorithm?
- What is the euclidean measure and what is its role in Aha's algorithm?
- I prefer dividing the results of Aha's distance measure by
sqrt(#dimensions). Why?
Write down a table of data with with 5 rows and three columns and one class where column 2 is tightly correlated to column 1. With respect to this data, answer:
- Why is Naive Bayes called Naive?
- Why is Naive Bayes fast?
Write down Bayes' theorem, carefully defining all terms. With respect to the following data, write down examples of all terms in the theorem.
outlook temperature humidity windy play
------- ----------- -------- ----- ----
rainy cool normal TRUE no
rainy mild high TRUE no
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no
overcast cool normal TRUE yes
overcast hot high FALSE yes
overcast hot normal FALSE yes
overcast mild high TRUE yes
rainy cool normal FALSE yes
rainy mild high FALSE yes
rainy mild normal FALSE yes
sunny cool normal FALSE yes
sunny mild normal TRUE yes
Using a Naive Bayes classifier, classify the following test example. Show all working:
Outlook Temp. Windy
Sunny Cool True
How does Naive Bayes handle missing data in the test example?
The following Python code manages the like computation.
- For what to different kind of data would this new function fail?
- What is the purpose of the K and M variables?
def like(row, all, klasses):
"top-level pything function"
guess, best, nh, k = None, -1*10**32, len(klasses), THE.nbk
for this,tbl in klasses.items():
guess = guess or this
like = prior = (len(tbl._rows) + k) / (all + k * nh)
for col in tbl.decs:
if col.my:
x = row[col.pos]
if x != Thing.UNKNOWN:
like *= col.my.like( x, prior) # see code, below
if like > best:
guess,best = this,like
return guess
"Num class..."
def like(i,x,*_):
var = i.sd()**2
denom = (2*math.pi*var)**.5
num = math.exp(-(x-i.mu)**2/(2*var))
return num/denom
"Sym class..."
def like(i,x,prior):
m = 2
return (i.counts.get(x,0) + m*prior)/(i.n + m)- In the above, using diagrams showing specific examples can you explain what is going:
- in
Num.like - in
Sym.like
- in
Num and Sym class
- These classes have four groups of servives. For each, describe when thy ,ight be useful:
add,sub- variability
likedist
- How is variability different for
NumandSym?
Table class
- This class creates one
NumorSymobject per header then stores that object in one or more "convenience" lists. For each of the following, describe when they do and when you might proces sthem different to other columns:colsobjs(hint: not for standard data mining, but for optimization)klass(hint: not for optimziation)deps(hint: dependent variables)decs(hint: independent variables)
Table's distance function is shown below.- When computing distance between rows in -means, the
klassis _not- used. Why? - For
NumandSymwrite down in English howdistmight work (3rd last line). Remember normalization! - Under what circumstances would
nbe less that the number of independent attributes? - What is point of dividing the sum of the distance by n**(1/f)?
- When computing distance between rows in -means, the
def distance(i,r1,r2,cols=None,f=None):
cols = cols or Table.DIST(i)
f = f or THE.edist
d,n = 0, 10**-32
for col in cols:
x, y = r1[col.pos], r2[col.pos]
if x is Thing.UNKNOWN and y is Thing.UNKNOWN:
continue
if x is Thing.UNKNOWN: x=col.my.furthest(y)
if y is Thing.UNKNOWN: y=col.my.furthest(x)
n += 1
inc = col.my.dist(x,y)**f
d += inc
return d**(1/f)/n**(1/f)- How might the following algorithms use
Table? - Naive Bayes - Mini-batch K-means - Reading data from disk - Incrementally processing data inerasof 1000 instances per era