Python-exemplarisch

OneR-Algorithmus

Alle Programme können von hier heruntergeladen werden.

Problembeschreibung

Oft hat man Datensammlungen mit Datensätzen, bei denen der Wert eines bestimmten Zielattributs mit den Werten anderer Attribute zusammenhängt. Es stellt sich die Frage, welches dieser Attribute den überwiegenden Einfluss aus das Zielattribut hat. Als wichtiges Beispiel untersucht man, inwiefern sich die Wetterlage zu einer bestimmten Zeit, sagen wir heute um Mittag, auf Grund der Wetterbeobachtungen am frühen Morgen voraussagen lässt.

Der Einfachheit betrachten wir als Zielattribut nur die Wetterwerte Schneefall (snow), Regenfall (rain), Sonnig (sun). Als beobachtete Wetterattribute wählen wir

Wolken (clouds) mit den Werten bedeckt (overcast), bewölkt (cloudy), klar (clear)
Temperatur (temperature) mit den Werten kalt (cold), mild (mild), heiss (hot)
Luftfeuchtigkeit (humidity) mit den Werten tief (low), normal (normal), hoch (high)
Wind (wind) mit den Werten schwach (weak), stark (strong)

Das Trainings-Set besteht aus nur 20 Datensätzen in der Datei weather.csv, die von hier heruntergeladen werden kann.

Clouds	Temperature	Humidity	Wind	Weather
clear	cold	high	strong	snow
clear	hot	high	weak	rain
cloudy	cold	high	strong	snow
clear	mild	high	strong	sun
clear	mild	normal	strong	sun
cloudy	cold	normal	weak	snow
overcast	cold	low	weak	sun
cloudy	mild	high	strong	rain
cloudy	cold	low	strong	snow
clear	mild	low	strong	sun
clear	hot	low	weak	snow
clear	mild	low	weak	sun
overcast	cold	normal	strong	snow
cloudy	cold	high	weak	rain
overcast	hot	normal	strong	sun
cloudy	hot	normal	weak	rain
clear	hot	normal	weak	rain
cloudy	mild	normal	weak	sun
overcast	mild	high	weak	rain
overcast	hot	low	strong	sun

Beim OneR-Algorithmus untersucht man bei jedem Attribut (Wolken, Temperatur, Luftfeuchtigkeit, Wind) einzeln, ob man damit eine möglichst gute Wettervoraussage machen kann. (Daher der Name One-Rule, abgekürzt OneR.) Für jedes Attribut bestimmt man eine Fehlerquote ein und wählt dann am Schluss für die Wettervoraussage dasjenige Attribut mit der kleinsten Fehlerquote.

Voraussage mit einem einzelnen Attribut

Wir wählen ein bestimmtes Attribut aus, beispielsweise die Luftfeuchtigkeit und bestimmen dann für jeden Wahl des Wert des Attributs tief, normal, hoch die Zahl der Instanzen für jede Wetterlage. Anders gesagt, wir untersuchen für eine tiefe, dann für eine normale und dann für eine hohe Luftfeuchtigkeit wie oft dies zu im Trainings-Set zu Sonnig, Regenfall, Schneefall geführt hat.

Es ist vorteilhaft, die Attribute mit ihren Werten in einem Dictionary attributes und die Häufigkeiten in einem Dictionary frequencies abzulegen.

Programm: [►]

# OneRule1.py

dataFile = "weather.csv"

attributes = {"Clouds":["overcast", "cloudy", "clear"], 
              "Temperature":["cold", "mild", "hot"], 
              "Humidity":["low", "normal", "high"], 
              "Wind":["weak", "strong"]}
attributeNames = ["Clouds", "Temperature", "Humidity", "Wind"]
# targetValues: "sun", "rain", "snow"

def loadData(fileName):
    try:    
        fData = open(fileName, 'r')
    except:
        return []
    out = []
    for line in fData:
        line = line[:-1]  # remove \n
        if len(line) == 0:  # empty line
            continue
        li = [i for i in line.split(",")]
        out.append(li)
    fData.close()
    return out

X = loadData(dataFile)

attribute = "Humidity"
errorCount = 0
print "Considering attribute: '%s' now..." %(attribute)
for value in attributes[attribute]:
    print "  Statistics for attribute '%s' with value '%s':" \
          %(attribute, value)
    frequencies = {"sun":0, "rain":0, "snow":0}
    k = X[0].index(attribute) # position of attribute
    for sample in X[1:]:
        if sample[k] == value:
            if sample[4] == "sun":
                frequencies["sun"] += 1
            if sample[4] == "rain":
                frequencies["rain"] += 1
            if sample[4] == "snow":
                frequencies["snow"] += 1
    print "    Frequency (number of instances): \
            'sun': %s, 'rain':, %s, 'snow': %s" \
            %(frequencies["sun"], frequencies["rain"], frequencies["snow"])

Programmcode markieren (Ctrl+C kopieren, Ctrl+V einfügen)

Resultat:
Es wird folgende Statistik ausgeschrieben:
Considering attribute: 'Humidity' now...
Statistics for attribute 'Humidity' with value 'low':
Frequency (number of instances): 'sun': 4, 'rain':, 0, 'snow': 2
Statistics for attribute 'Humidity' with value 'normal':
Frequency (number of instances): 'sun': 3, 'rain':, 2, 'snow': 2
Statistics for attribute 'Humidity' with value 'high':
Frequency (number of instances): 'sun': 1, 'rain':, 4, 'snow': 2
Daraus entnehmen wir, dass die besten Voraussagen die folgenden sind:

'huminity low'->'sun', 'huminity normal'->'sun', 'huminity high'->rain

Fehlerquote

Wenden wir für das Attribut Luftfeuchtigkeit das vorher herausgefundene Voraussagemodell mit der Regel:

humidity low->sun, humidity normal->sun, hudinity high->rain

an, so lässt sich eine Fehlerquote als Anzahl der Falschvorhersagen angeben. Für jeden Wert des Attributes lesen wir aus dem obigen Resultat die folgenden Werte heraus:

humidity low->2, humidity normal->4, humidity high->3.

Um diese Zahlen zu berechnen, ergänzen wir das vorherige Programm und ordnen als erstes durch Sortieren des Dictionaries die Wetterwerte in der Reihenfolge zunehmender Treffer an. Der Wert für das Voraussagemodell ist dann an 3. Stelle (index:2). Wir speichern ihn im Dictionary predictionRule. Die Fehlerwerte sammeln wir im Dictionary errorCount.

Zuletzt ziehen wir noch Bilanz und schreiben die Summe der Fehler als Güte für das Voraussagemodell unter Benützung des gewählten Attributes (Luftfeuchtigkeit) aus.

Programm: [►]

# OneRule2.py


dataFile = "weather.csv"

attributes = {"Clouds":["overcast", "cloudy", "clear"], 
              "Temperature":["cold", "mild", "hot"], 
              "Humidity":["low", "normal", "high"], 
              "Wind":["weak", "strong"]}
attributeNames = ["Clouds", "Temperature", "Humidity", "Wind"]
# targetValues: "sun", "rain", "snow"

def loadData(fileName):
    try:    
        fData = open(fileName, 'r')
    except:
        return []
    out = []
    for line in fData:
        line = line[:-1]  # remove \n
        if len(line) == 0:  # empty line
            continue
        li = [i for i in line.split(",")]
        out.append(li)
    fData.close()
    return out

X = loadData(dataFile)

attribute = "Humidity"
predictionRule = {}
errorCount = {}
print "Considering attribute: '%s' now..." %(attribute)
for value in attributes[attribute]:
    print "  Statistics for attribute '%s' with value '%s':" \
             %(attribute, value)
    frequencies = {"sun":0, "rain":0, "snow":0}
    k = X[0].index(attribute) # position of attribute
    for sample in X[1:]:
        if sample[k] == value:
            if sample[4] == "sun":
                frequencies["sun"] += 1
            if sample[4] == "rain":
                frequencies["rain"] += 1
            if sample[4] == "snow":
                frequencies["snow"] += 1
    print "    Frequency (number of instances): \
'sun': %s, 'rain':, %s, 'snow': %s" \
           %(frequencies["sun"], frequencies["rain"], frequencies["snow"])
    # sort by dictionary value
    ratings = sorted(frequencies, key = frequencies.get) 
    predictionRule[value] = ratings[2]
    errorCount[value] = frequencies[ratings[0]] + frequencies[ratings[1]]

print "Prediction Rule:", predictionRule
print "Error Count:", errorCount
err = 0
for v in errorCount.values():
    err += v
print "Total error for prediction with attribute '%s': %d" \
      %(attribute, err)

Programmcode markieren (Ctrl+C kopieren, Ctrl+V einfügen)

Resultat:
Wie erwartet, erhalten wir folgende Ausgabe:
Considering attribute: 'Humidity' now...
Statistics for attribute 'Humidity' with value 'low':
Frequency (number of instances): 'sun': 4, 'rain':, 0, 'snow': 2
Statistics for attribute 'Humidity' with value 'normal':
Frequency (number of instances): 'sun': 3, 'rain':, 2, 'snow': 2
Statistics for attribute 'Humidity' with value 'high':
Frequency (number of instances): 'sun': 1, 'rain':, 4, 'snow': 2
Prediction Rule: {'normal': 'sun', 'high': 'rain', 'low': 'sun'}
Error Count: {'normal': 4, 'high': 3, 'low': 2}
Total error for prediction with attribute 'Humidity': 9

OneR-Analyse: Die Suche nach dem besten Attribut

Bisher haben wird nur ein einzelnes Attribut betrachtet. Der OneR-Algorithmus definiert sich folgendermassen:

Man suche nach dem Attribut mit dem kleinsten Fehler und wende für Vorhersagen nur die Regel dieses einen Attributs an.

Damit wir alle Attribute gut durchlaufen können, verschieben wir den Code zur Bestimmung der Regel und des Fehlers in eine Funktion createRule(attribute). Am Schluss suchen wir nach dem Attribut mit dem kleinsten Fehler.

Programm: [►]

# OneRule3.py

from operator import itemgetter

dataFile = "weather.csv"

attributes = {"Clouds":["overcast", "cloudy", "clear"], 
              "Temperature":["cold", "mild", "hot"], 
              "Humidity":["low", "normal", "high"], 
              "Wind":["weak", "strong"]}
attributeNames = ["Clouds", "Temperature", "Humidity", "Wind"]
# targetValues: "sun", "rain", "snow"

def loadData(fileName):
    try:    
        fData = open(fileName, 'r')
    except:
        return []
    out = []
    for line in fData:
        line = line[:-1]  # remove \n
        if len(line) == 0:  # empty line
            continue
        li = [i for i in line.split(",")]
        out.append(li)
    fData.close()
    return out

def createRule(attribute):
    predictionRule = {}
    errorCount = {}
    print "Considering attribute: '%s' now..." %(attribute)
    for value in attributes[attribute]:
        frequencies = {"sun":0, "rain":0, "snow":0}
        k = X[0].index(attribute) # position of attribute
        for sample in X[1:]:
            if sample[k] == value:
                if sample[4] == "sun":
                    frequencies["sun"] += 1
                if sample[4] == "rain":
                    frequencies["rain"] += 1
                if sample[4] == "snow":
                    frequencies["snow"] += 1
        # sort by dictionary value
        ratings = sorted(frequencies, key = frequencies.get) 
        predictionRule[value] = ratings[2]
        errorCount[value] = frequencies[ratings[0]] + frequencies[ratings[1]]
    err = 0
    for v in errorCount.values():
        err += v
    return predictionRule, err

X = loadData(dataFile)
rules = {}
errors = {}
for attribute in attributeNames:
    rule, error = createRule(attribute)
    print "Rule", rule, "Error:", error
    rules[attribute] = rule
    errors[attribute] = error
errors = sorted(errors, key = itemgetter(1))
bestAttribute = errors[0]
print "\nResult of OneR Analysis:"
print "Best attribute:", bestAttribute
print "Rule:", rules[bestAttribute]

Programmcode markieren (Ctrl+C kopieren, Ctrl+V einfügen)

Resultat:
Considering attribute: 'Clouds' now...
Rule {'overcast': 'sun', 'clear': 'sun', 'cloudy': 'snow'} Error: 10
Considering attribute: 'Temperature' now...
Rule {'mild': 'sun', 'cold': 'snow', 'hot': 'rain'} Error: 7
Considering attribute: 'Humidity' now...
Rule {'normal': 'sun', 'high': 'rain', 'low': 'sun'} Error: 9
Considering attribute: 'Wind' now...
Rule {'strong': 'sun', 'weak': 'rain'} Error: 10

Result of OneR Analysis:
Best attribute: Temperature
Rule: {'mild': 'sun', 'cold': 'snow', 'hot': 'rain'}

Anwendung auf das Test-Set

Schliesslich überprüfen wir die Voraussagegüte, indem wir die durch die OneR-Analyse gefundene Regel auf eine Datensammlung eines Test-Sets aus der Datei weathertest.csv anwenden. Diese enthält nur 6 Instanzen und kann von hier heruntergeladen werden.

Programm: [►]

# OneRule4.py

from operator import itemgetter

dataFile = "weather.csv"
testFile = "weathertest.csv"

attributes = {"Clouds":["overcast", "cloudy", "clear"], 
              "Temperature":["cold", "mild", "hot"], 
              "Humidity":["low", "normal", "high"], 
              "Wind":["weak", "strong"]}
attributeNames = ["Clouds", "Temperature", "Humidity", "Wind"]
# targetValues: "sun", "rain", "snow"

def loadData(fileName):
    try:    
        fData = open(fileName, 'r')
    except:
        return []
    out = []
    for line in fData:
        line = line[:-1]  # remove \n
        if len(line) == 0:  # empty line
            continue
        li = [i for i in line.split(",")]
        out.append(li)
    fData.close()
    return out

def createRule(attribute):
    predictionRule = {}
    errorCount = {}
    for value in attributes[attribute]:
        frequencies = {"sun":0, "rain":0, "snow":0}
        k = X[0].index(attribute) # position of attribute
        for sample in X[1:]:
            if sample[k] == value:
                if sample[4] == "sun":
                    frequencies["sun"] += 1
                if sample[4] == "rain":
                    frequencies["rain"] += 1
                if sample[4] == "snow":
                    frequencies["snow"] += 1
        # sort by dictionary value
        ratings = sorted(frequencies, key = frequencies.get) 
        predictionRule[value] = ratings[2]
        errorCount[value] = frequencies[ratings[0]] + frequencies[ratings[1]]
    err = 0
    for v in errorCount.values():
        err += v
    return predictionRule, err

X = loadData(dataFile)
rules = {}
errors = {}
for attribute in attributeNames:
    rule, error = createRule(attribute)
    rules[attribute] = rule
    errors[attribute] = error
errors = sorted(errors, key = itemgetter(1))
bestAttribute = errors[0]
print "Result of OneR Analysis:"
print "Best attribute:", bestAttribute
print "Rule:", rules[bestAttribute]
print "Applying now on test set..."
Y = loadData(testFile)
hit = 0
fail = 0
index = attributeNames.index(bestAttribute)
for sample in Y:
    value = sample[index]
    prediction = rules[bestAttribute][value]
    if prediction == sample[4]:
        hit += 1
    else:
        fail += 1
print "Success summary:"
percent = 100 * hit / (hit + fail)
print hit, "hits -->", int(percent + 0.5), "percent."

Programmcode markieren (Ctrl+C kopieren, Ctrl+V einfügen)

Bemerkungen:
Das Resultat hängt stark von den Daten des Trainings- und Test-Sets ab, die ja ziemlich willkürlich gewählt wurden.