Classifying Kepler Objects of Interest with Logistic Regression

Andrew Bossie
5 min readJul 5, 2021


Photo by Sven Scheuermeier on Unsplash


From 2009 to 2016, NASA’s Kepler Mission set out to study the probability that objects of interest revolving around candidate stars could be classified as exoplanets. I had been interested in this mission years ago, going so far as helping out with the Planet Hunters project. Years later, while poking around Kaggle for interesting datasets I stumbled upon Kepler Mission data once again.


Can we use existing data from Kepler’s exoplanet survey to predict the classification of object of interest? Moreover, can we model without using neural networks?


The best part of this project is that we use python(3) and only need 2 additional libraries:

  • sklearn
  • matplotlib

The Code

All the code for this can be found on my Github here. Everything is laid out in a commented Jupyter notebook. Following along is strongly encouraged.

A Quick Look at the Data

Let’s import the CSV and take a look…

koi_df = pd.read_csv("cumulative.csv")
  • * For brevity sake I have omitted the preceding print return here, follow along in the notebook.

It’s important to understand what’s included in this dataset. The Kepler Mission studied light variances in questioned stars. As an object of interest transits in front of a star the light will dim a certain magnitude as well occur for a certain length of time. Potential planets also have a regularly repeating period of transits.

Our dataset includes fields that describe these observations:

koi_duration: duration of transit
koi_depth: stellar flux lost during transit
koi_period: The interval between consecutive planetary transit

The observation fields in turn describe koi_disposition, which is the classification label for each row of data. We will use the following label values during training and testing:

len_df = len(koi_df)
len_con = len(koi_df[koi_df['koi_disposition'] == 'CONFIRMED'])
len_can = len(koi_df[koi_df['koi_disposition'] == 'CANDIDATE'])
len_fal = len(koi_df[koi_df['koi_disposition'] == 'FALSE POSITIVE'])
# Total Length
print(f"Number of entries: {len_df}")
> Number of entries: 9564
# Confirmed
print(f"Number of confirmed: {len_con}")
> Number of confirmed: 2293
# Candidate
print(f"Number of candidates: {len_can}")
> Number of candidates: 2248
# False Positives (Not Expoplanet)
print(f"Number of false positives: {len_fal}")
> Number of false positives: 5023
  • See here for more information on the data columns

Cleanup and Feature Extraction

Now that we have had a cursory glance at the data, let’s do some feature extraction…

# Flags:
# koi_fpflag_nt = 1 (Not Transit-Like Flag)
# koi_fpflag_ss = 1 (Stellar Eclipse Flag)
# koi_fpflag_co = 1 (Centroid Offset Flag)
# koi_fpflag_ec = 1 (Ephemeris Match Indicates Contamination Flag)
# Integer Encode
koi_df.loc[koi_df.koi_disposition == 'CONFIRMED', 'category'] = 0
koi_df.loc[koi_df.koi_disposition == 'CANDIDATE', 'category'] = 1
koi_df.loc[koi_df.koi_disposition == 'FALSE POSITIVE', 'category'] = 2
# Force null uncertainty to 0
koi_df['koi_duration_err1'] = koi_df['koi_duration_err1'].fillna(0)
koi_df['koi_depth_err1'] = koi_df['koi_depth_err1'].fillna(0)
koi_df['koi_period_err1'] = koi_df['koi_period_err1'].fillna(0)
# Depth can have null values. lets remove.
koi_df.dropna(subset = ["koi_depth"], inplace=True)
kepid: unique id (Display only),
kepler_name: name (Display only),
koi_disposition: category (label),
koi_duration: duration of transit (feature),
koi_duration_err1: duration uncertainty (feature)
koi_depth: stellar flux lost during transit (feature)
koi_depth_err1: depth uncertainty (feature)
koi_period: The interval between consecutive planetary transits(feature)
koi_period_err1: Period uncertainty (feature)
prad: eccentricity value(feature)
koi_df = koi_df[['kepid',

First order is to integer encode the label data (koi_disposition). The reasoning behind this is that our model wont be able to work with text data. We have three labels here so we integer encode 0 to 2.

Next, I’ve pre-selected some fields that stuck out to me that could be important contributors to the labels. We will narrow these down later.

Let’s take the cleaned dataframe and look at a correlation graph to see if anyone stands out above others.

# Correlation
  • * Once again, for brevity sake I have omitted the preceding print return here, follow along in the notebook.

Here we can see our fields for ‘period’ have a low correlation to koi_disposition and thus are not very important in determining the label. We will go ahead and remove from our dataset.

Let’s scale our data to get everything into the same scale. This is an important step because our model will regularize the input data. Regularizion is the process of placing a penalty on learning that limits a model’s ability to overfit the data. Without standardizing our data, the model will have a very hard time with penalization.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()# scale
] = scaler.fit_transform(koi_df[['koi_depth',
koi_df = koi_df.reset_index()

Next step is to split our data into training and testing sets. Here we will split the data into 2/3 train, 1/3 test. You can adjust if need be.

# 2/3 split
x_train = koi_df[['koi_duration',
x_test = koi_df[['koi_duration',
y_train = koi_df[['category']][:7356]
y_test = koi_df[['category']][7357:]

The Model

For this project we will be making use of sklearn’s Logistic Regression model. It is regularized by default, hence our standardization earlier. This model is quite fast, easy to use and will allow us to classify using discrete values (in this case 3 values, the categories).

Let’s import and build the model:

from sklearn.linear_model import LogisticRegression# Build (regularized by default)
regressor = LogisticRegression(max_iter=1000)

We can fit our training data to the model with:

history =, y_train.values.ravel())

** ravel() is used here to format the dataframe column into a 1-d Array


Now that we have trained our model we can make some predictions. We will score these predictions against our test data.

Let’s use our new predictions dataframe to determine out-of-network model accuracy as well as generate a confusion matrix…

from sklearn.metrics import confusion_matrix
import numpy as np
predictions = regressor.predict(x_test)
pred_df = pd.DataFrame(predictions, columns=["y_hat"])
pred_df['y'] = y_test.values
pred_df.to_csv("preds.csv", index=False)

Now that we have our predictions let’s find the model accuracy along with a confusion matrix…

# Accuracy
correct = len(pred_df.loc[pred_df['y'] == pred_df['y_hat']])
print(f"Correct Predictions: {correct}")
accuracy = (correct / len(pred_df)) * 100
print(f"Accuracy: {accuracy} %")
>>Correct Predictions: 1679
>>Accuracy: 91.05206073752711 %
# Alternatively...
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)
# Confusion Matrix
confusion_matrix(predictions, y_test, labels=[0, 1, 2])
>>array([[ 24, 83, 9],
[ 11, 321, 46],
[ 3, 13, 1334]])

As you can see we achieved an out-of-network accuracy of ~91%. From our confusion matrix we classified:

  • 24 correct Confirmed
  • 321 correct Candidates
  • 1331 correct False Positives

Next Steps

As for a quick and dirty method this works fairly well, however there can always be more to do here. For the future, I would recommend further feature extraction as well as using more than one model.