Classifying Kepler Objects of Interest with Logistic Regression

Photo by Sven Scheuermeier on Unsplash

Inspiration

Question

Requirements

  • sklearn
  • matplotlib

The Code

A Quick Look at the Data

koi_df = pd.read_csv("cumulative.csv")
print(koi_df.head())
  • * For brevity sake I have omitted the preceding print return here, follow along in the notebook.
koi_duration: duration of transit
koi_depth: stellar flux lost during transit
koi_period: The interval between consecutive planetary transit
  • CONFIRMED
  • CANDIDATE
  • FALSE POSITIVE
len_df = len(koi_df)
len_con = len(koi_df[koi_df['koi_disposition'] == 'CONFIRMED'])
len_can = len(koi_df[koi_df['koi_disposition'] == 'CANDIDATE'])
len_fal = len(koi_df[koi_df['koi_disposition'] == 'FALSE POSITIVE'])
# Total Length
print(f"Number of entries: {len_df}")
> Number of entries: 9564
# Confirmed
print(f"Number of confirmed: {len_con}")
> Number of confirmed: 2293
# Candidate
print(f"Number of candidates: {len_can}")
> Number of candidates: 2248
# False Positives (Not Expoplanet)
print(f"Number of false positives: {len_fal}")
> Number of false positives: 5023
  • See here for more information on the data columns

Cleanup and Feature Extraction

# Flags:
# koi_fpflag_nt = 1 (Not Transit-Like Flag)
# koi_fpflag_ss = 1 (Stellar Eclipse Flag)
# koi_fpflag_co = 1 (Centroid Offset Flag)
# koi_fpflag_ec = 1 (Ephemeris Match Indicates Contamination Flag)
# Integer Encode
koi_df.loc[koi_df.koi_disposition == 'CONFIRMED', 'category'] = 0
koi_df.loc[koi_df.koi_disposition == 'CANDIDATE', 'category'] = 1
koi_df.loc[koi_df.koi_disposition == 'FALSE POSITIVE', 'category'] = 2
# Force null uncertainty to 0
koi_df['koi_duration_err1'] = koi_df['koi_duration_err1'].fillna(0)
koi_df['koi_depth_err1'] = koi_df['koi_depth_err1'].fillna(0)
koi_df['koi_period_err1'] = koi_df['koi_period_err1'].fillna(0)
# Depth can have null values. lets remove.
koi_df.dropna(subset = ["koi_depth"], inplace=True)
'''
Using:
kepid: unique id (Display only),
kepler_name: name (Display only),
koi_disposition: category (label),
koi_duration: duration of transit (feature),
koi_duration_err1: duration uncertainty (feature)
koi_depth: stellar flux lost during transit (feature)
koi_depth_err1: depth uncertainty (feature)
koi_period: The interval between consecutive planetary transits(feature)
koi_period_err1: Period uncertainty (feature)
prad: eccentricity value(feature)
'''
koi_df = koi_df[['kepid',
'kepler_name',
'koi_disposition',
'koi_duration',
'koi_duration_err1',
'koi_depth',
'koi_depth_err1',
'koi_period',
'koi_period_err1',
'koi_prad',
'koi_fpflag_nt',
'koi_fpflag_ss',
'koi_fpflag_co',
'category']]
# Correlation
print(koi_df[['kepid',
'kepler_name',
'koi_disposition',
'koi_duration',
'koi_duration_err1',
'koi_depth',
'koi_depth_err1',
'koi_period',
'koi_period_err1',
'koi_prad',
'koi_fpflag_nt',
'koi_fpflag_ss',
'koi_fpflag_co',
'category']].corr())
  • * Once again, for brevity sake I have omitted the preceding print return here, follow along in the notebook.
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()# scale
koi_df[['koi_depth',
'koi_duration',
'koi_prad']
] = scaler.fit_transform(koi_df[['koi_depth',
'koi_duration',
'koi_prad']
])
koi_df = koi_df.reset_index()
# 2/3 split
x_train = koi_df[['koi_duration',
'koi_duration_err1',
'koi_depth',
'koi_depth_err1',
'koi_prad',
'koi_fpflag_nt',
'koi_fpflag_ss',
'koi_fpflag_co']][:7356]
x_test = koi_df[['koi_duration',
'koi_duration_err1',
'koi_depth',
'koi_depth_err1',
'koi_prad',
'koi_fpflag_nt',
'koi_fpflag_ss',
'koi_fpflag_co']][7357:]
y_train = koi_df[['category']][:7356]
y_test = koi_df[['category']][7357:]

The Model

from sklearn.linear_model import LogisticRegression# Build (regularized by default)
regressor = LogisticRegression(max_iter=1000)
history = regressor.fit(x_train, y_train.values.ravel())

Prediction

from sklearn.metrics import confusion_matrix
import numpy as np
predictions = regressor.predict(x_test)
pred_df = pd.DataFrame(predictions, columns=["y_hat"])
pred_df['y'] = y_test.values
pred_df.to_csv("preds.csv", index=False)
# Accuracy
correct = len(pred_df.loc[pred_df['y'] == pred_df['y_hat']])
print(f"Correct Predictions: {correct}")
accuracy = (correct / len(pred_df)) * 100
print(f"Accuracy: {accuracy} %")
>>Correct Predictions: 1679
>>Accuracy: 91.05206073752711 %
# Alternatively...
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)
# Confusion Matrix
confusion_matrix(predictions, y_test, labels=[0, 1, 2])
>>array([[ 24, 83, 9],
[ 11, 321, 46],
[ 3, 13, 1334]])
  • 24 correct Confirmed
  • 321 correct Candidates
  • 1331 correct False Positives

Next Steps

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store