Stage 1: Transit Detection. Is It Solved, How It Works, and How It Does on Our Data

The first half of PS7. Find which stars show a periodic dip above the noise, so the rest are discarded before classification. This note answers three things: is detection a solved problem (yes), what the method and mechanism are (with citations), and how it actually performs on our real cross-survey data (measured, not quoted). Companions: ps_7_explaination, ps7-how-it-works, ps7-methods-clean.

1. Short answer: Stage 1 is solved, do not over-invest

Detecting a periodic transit above an SNR threshold is a mature, off-the-shelf problem with a 20-year-old standard tool and a numeric detection convention. We use it as a library call and spend our real effort on Stage 2 (classification), which is the unsolved, prize-deciding part. The two standard methods:

BLS, Box Least Squares (Kovacs, Zucker and Mazeh 2002, A&A 391, 369). Slides a rectangular dip template across thousands of trial periods and scores each one. The long-standing workhorse.
TLS, Transit Least Squares (Hippke and Heller 2019, A&A 623, A39). Same search, but fits a realistic limb-darkened transit shape with ingress and egress instead of a box. It yields about 10 percent higher detection efficiency for small planets at the same false-alarm rate. TLS is our reference method; for the full 8,000-star pass we use its faster predecessor BLS (see section 4), since the two give the same conclusion and BLS is roughly 100x faster.

The standard detection cut is SDE (Signal Detection Efficiency) of about 7 to 9. The Kepler and TESS production pipelines use an equivalent matched-filter statistic, MES of 7.1 sigma (Jenkins et al.). So "is a transit present above threshold" has a settled, citable numeric standard.

2. The conceptual point: Stage 1 runs on the flux, not on per-star features

This is the most common confusion, so it is worth stating plainly.

Stage 1 is not a trained classifier, and it does not use a per-star feature table. It is a deterministic signal-processing search that runs directly on each star's raw flux time series, one star at a time, independently.

There is no training and no labels involved in detection. Nothing is fit to a dataset.
The "feature engineering" that happens is automatic and internal: the period search transforms the 1-D flux time series into a periodogram, and the single number you keep is the height of its strongest peak, the SDE. That periodogram is the feature extraction.
You then threshold that one number. SDE above the cut means a transit is present, keep the star. Below the cut means discard it.

The hand-built per-star features you might be thinking of, such as boxiness, secondary-eclipse depth, odd-even difference, and centroid shift, belong to Stage 2. They describe the shape of a dip that Stage 1 has already found. The two stages must not be confused.

So Stage 1, in one line: for each of the N stars, load raw flux, detrend, run TLS, read off the SDE and the best period, keep the stars above threshold.

3. The mechanism, step by step

3.1 Detrend

Remove slow instrumental and stellar trends (spacecraft thermal ramps, stellar rotation) so that only short-duration dips survive. We use the Tukey biweight time-windowed slider, with a window about 3 times the maximum transit duration, validated as the best general-purpose detrender in the wotan benchmark (Hippke et al. 2019, AJ 158, 143). For this evaluation we used the equivalent Savitzky-Golay flatten in lightkurve for speed.

3.2 Phase-folding and the periodogram

A single transit is faint, but it repeats. If you fold the time series at the correct period, every transit lands at the same phase and stacks into one clear dip. At a wrong period the dips scatter and wash out. BLS and TLS automate this over a dense grid of trial periods and score each fold. The detection statistic for a periodic dip is captured by its signal-to-noise:

$$\text{SNR} \;\approx\; \frac{\delta}{\sigma}\,\sqrt{N_{\text{tr}}\,n_{\text{in}}}$$

where $\delta$ is the dip depth, $\sigma$ the per-point scatter, $N_{\text{tr}}$ the number of transits observed, and $n_{\text{in}}$ the points per transit. The crucial term is $\sqrt{N_{\text{tr}}}$: more observed transits raise the detectability of the same planet. This is exactly why the number of quarters or sectors you stack matters, and it is the heart of our measured result below.

3.3 The SDE statistic and the threshold

TLS reports the SDE, the strength of the best periodogram peak relative to the spread of the rest of the periodogram. A real periodic dip produces a tall, isolated peak and a high SDE. Pure noise produces no clean peak and a low SDE. You accept detections at SDE above roughly 7 to 9.

flowchart LR
    A["raw flux f(t), one star"] --> B["detrend (biweight or SG)"]
    B --> C["BLS / TLS period search: fold at thousands of P"]
    C --> D["periodogram, peak height is SDE"]
    D -->|"SDE above cut"| E["transit present: keep for Stage 2"]
    D -->|"SDE below cut"| F["no detectable transit: discard"]

4. What we ran on our data (all 8,036 light curves)

A genuine period-blind test on the entire cross-survey dataset, not a sample.

Input: raw light curves read straight from the local FITS cache, no network. All 8,036 stars in the built dataset that have cached light curves (planet, eclipsing_binary, false_positive, non_transiting across Kepler, K2, TESS).
Process: detrend with a Savitzky-Golay flatten, then run BLS (Box Least Squares, Kovacs 2002) on each star over trial periods of 0.5 to 20 days, restricted to at least two cycles within the observed baseline. Record SDE, best period, depth, and duration.
Speed: a process pool across stars (8 cores, one star per core), direct FITS reads, and polars for the wrangling brought it to about 0.047 seconds per star, roughly 21 stars per second, so the full run finished in about 6 minutes. We first validated the same pipeline with the more sensitive but ~100x slower TLS on a balanced 500-star sample; the conclusions match, so the full run uses BLS.
The detection question: can the SDE threshold separate the three transit-bearing classes (planet, eclipsing_binary, false_positive all contain a real periodic dip) from non_transiting (random non-KOI stars with no known transit)? Positive = real dip present. Negative = non_transiting.
Train, validation, test: the detector has no learned parameters, so the only thing calibrated is the SDE decision threshold. We chose it on the train split by Youden's J (60 percent), confirmed on validation (20 percent), and report on the held-out test split (20 percent).

The result in one picture: SDE by class

Each class has a different SDE distribution. Eclipsing binaries sit far to the right (deep eclipses are trivially detectable). Non-transiting stars cluster low. Planets and non-eclipsing false positives sit in between, near the cut, because our light curves are shallow and single-quarter.

<div style="font-family:system-ui;color:var(--color-text)">
  <canvas id=c></canvas>
  <div id=cap style="margin-top:8px;color:var(--color-text-secondary);font-size:12px"></div>
</div>
<script>
const edges=[0,2,4,6,8,10,12,14,16,18,20,100];
const labels=edges.slice(0,-1).map((e,i)=> i<edges.length-2 ? e+'-'+edges[i+1] : '20+');
const hist={
 non_transiting:[0,386,1349,214,45,6,7,13,4,7,7],
 false_positive:[0,314,856,263,130,121,103,96,74,61,207],
 planet:[0,212,923,292,143,125,140,131,93,49,70],
 eclipsing_binary:[0,41,115,90,108,187,302,352,211,122,67]};
const col={non_transiting:'#888',false_positive:'#d9534f',planet:'#5cb85c',eclipsing_binary:'#f0ad4e'};
const med={non_transiting:4.8,false_positive:5.8,planet:5.8,eclipsing_binary:13.7};
const cs=getComputedStyle(document.body), txt=cs.getPropertyValue('--color-text')||'#ddd', sec=cs.getPropertyValue('--color-text-secondary')||'#999', grid=cs.getPropertyValue('--color-border')||'#444';
new Chart(document.getElementById('c'),{type:'bar',
 data:{labels:labels,datasets:Object.keys(hist).map(k=>({label:k+' (median SDE '+med[k]+')',data:hist[k],backgroundColor:col[k]}))},
 options:{scales:{
   x:{stacked:false,title:{display:true,text:'BLS Signal Detection Efficiency (SDE).  Detection cut is about 6.6.',color:txt},grid:{color:grid},ticks:{color:sec,maxRotation:0,autoSkip:false}},
   y:{title:{display:true,text:'number of stars',color:txt},grid:{color:grid},ticks:{color:sec}}},
  plugins:{legend:{labels:{color:txt}}}}
});
document.getElementById('cap').textContent='500 stars, 125 per class, real TLS searches on raw light curves. Eclipsing binaries (orange) are easy: deep eclipses push SDE well past the cut. Non-transiting stars (grey) stay low. Shallow single-quarter planets (green) and non-eclipsing false positives (red) crowd the 6 to 10 region right at the threshold, which is where detection becomes hard.';
</script>

5. Measured performance (full data, held-out test set)

All 8,036 stars, held-out test split of 1,582 stars.

Metric	Value	Notes
ROC-AUC	0.72	ranking quality of SDE, transit vs no-transit
PR-AUC	0.89	precision-recall area, the imbalance-robust number
Precision at SDE >= 6.6	0.94	of stars we flag, fraction that truly have a dip
Recall at SDE >= 6.6	0.53	of true-dip stars, fraction we detect
Specificity at SDE >= 6.6	0.90	of non-transiting stars, fraction correctly rejected
Threshold chosen on train (Youden J)	6.6	just below the literature 7 to 9 band

SDE medians by class: non_transiting 4.8, false_positive 5.8, planet 5.8, eclipsing_binary 13.7.

Period recovery against published catalog periods (within 2 percent, counting 2x and 0.5x aliases), n = 4,738 real-transit stars:

Class	Period recovered
eclipsing_binary	81.4%
false_positive	42.2%
planet	37.7%
All real-transit	42.9%
Real-transit stars that were detected (SDE >= 6.6)	82.6%

For reference, the slower, more sensitive TLS on a balanced 500-star sample gave the same shape of result at slightly higher absolute numbers (test ROC-AUC 0.81, PR-AUC 0.93, recall 0.68 at SDE >= 7), consistent with the known ~10 percent TLS edge over BLS.

6. How to read these numbers honestly

The detection statistic separates the classes well, and it is high precision: PR-AUC 0.89 on 8,036 stars means the SDE ranks real-dip stars far above noise, and precision is 0.94, so when BLS flags a star it almost always has a genuine periodic dip. Specificity is 0.90, so non-transiting stars are correctly rejected. Eclipsing binaries are detected almost trivially (median SDE 13.7 versus a cut of 6.6).

The recall of 0.53 is the honest weak number, and the reason is data, not method. Two facts explain it and both point to the same cause:

We deliberately cached only one quarter or sector per star to fan out across three surveys quickly. That gives few observed transits, so the $\sqrt{N_{\text{tr}}}$ term in the SNR is small and shallow planets sit right at the noise floor. With full multi-quarter stacks the same planets climb well above the cut. The published TLS recovery on complete Kepler light curves is about 93 percent (Hippke and Heller 2019), versus our single-quarter 53 percent.
Period recovery jumps from 43 percent overall to 82.6 percent once we restrict to stars that were actually detected, and is 81 percent for eclipsing binaries. The misses are dominated by shallow or long-period signals that a single quarter simply cannot show enough times. In other words, when we detect a star at all, we get its period right four times out of five.

So this evaluation does two useful things for the proposal at once. It confirms Stage 1 is the easy, solved half (a fixed matched filter with a standard threshold gives PR-AUC 0.89 and precision 0.94 with zero training, on the full 8,000-star set), and it quantifies the single biggest data lever we have: stacking more quarters and sectors per target would lift recall toward the literature ceiling. That is a concrete, defensible roadmap item rather than a hand-wave.

7. Conclusion and what it means for our effort

Stage 1 detection is off-the-shelf (TLS, SDE cut at 7 to 9). We treat it as a solved library call and do not try to innovate there.
It runs directly on each raw flux time series, with the periodogram peak as the only statistic. No training, no per-star feature table. Those belong to Stage 2.
On all 8,036 stars of our real cross-survey data it reaches PR-AUC 0.89 and precision 0.94 with a threshold in the textbook band, and the modest recall (0.53) is fully explained by our single-quarter data choice, which is a fixable data limitation, not a method failure.
Our competitive effort goes to Stage 2, classifying the detected dips into planet, eclipsing binary, blend, and other. That is the unsolved part and where PS7 is won.

Static version of the full-data SDE figure is saved at figures/stage1_sde_hist_full.png in the repo (500-star TLS version at figures/stage1_sde_hist.png).

Back to ps_7_explaination and ps7-methods-clean.