Reliability and Validity of Smartphone Cognitive Testing for Frontotemporal Lobar Degeneration

Key Points Question Can remote cognitive testing via smartphones yield reliable and valid data for frontotemporal lobar degeneration (FTLD)? Findings In this cohort study of 360 patients, remotely deployed smartphone cognitive tests showed moderate to excellent reliability comparedwith criterion standard measures (in-person disease severity assessments and neuropsychological tests) and brain volumes. Smartphone tests accurately detected dementia and were more sensitive to the earliest stages of familial FTLD than standard neuropsychological tests. Meaning These findings suggest that remotely deployed smartphone-based assessments may be reliable and valid tools for evaluating FTLD and may enhance early detection, supporting the inclusion of digital assessments in clinical trials for neurodegeneration.


Introduction
Frontotemporal lobar degeneration (FTLD) is a neurodegenerative pathology causing early-onset dementia syndromes with impaired behavior, cognition, language, and/or motor functioning. 1 Although over 30 FTLD trials are planned or in progress, there are several barriers to conducting FTLD trials.Clinical trials for neurodegenerative disease are expensive, 2 and frequent in-person trial visits are burdensome for patients, caregivers, and clinicians, 3 a concern magnified in FTLD by behavioral and motor impairments.Given the rarity and geographical dispersion of eligible participants, FTLD trials require global recruitment, 4 particularly for those that are far from expert FTLD clinical trial centers.Furthermore, criterion standard neuropsychological tests are not adequately sensitive until symptoms are already noticeable to families, limiting their usefulness as outcomes in early-stage FTLD treatment trials. 4liable, valid, and scalable remote data collection methods may help surmount these barriers to FTLD clinical trials.Smartphones are garnering interest across neurological conditions as a method for administering remote cognitive and motor evaluations.Preliminary evidence supports the feasibility, reliability, and/or validity of unsupervised smartphone cognitive and motor testing in older adults at risk for Alzheimer disease, [5][6][7][8] Parkinson disease, 9 and Huntington disease. 10The clinical heterogeneity of FTLD necessitates a uniquely comprehensive smartphone battery.In the ALLFTD Consortium (Advancing Research and Treatment in Frontotemporal Lobar Degeneration [ARTFLD]   and Longitudinal Evaluation of Familial Frontotemporal Dementia Subjects [LEFFTDS]), the ALLFTD mobile Application (ALLFTD-mApp) was designed to remotely monitor cognitive, behavioral, language, and motor functioning in FTLD research.Taylor et al 11 recently reported that unsupervised ALLFTD-mApp data collection through a multicenter North American FTLD research network was feasible and acceptable to participants.Herein, we extend that work by investigating the reliability and validity of unsupervised remote smartphone tests of executive functioning and memory in a cohort with FTLD that has undergone extensive phenotyping.

Participants
Participants were enrolled from ongoing FTLD studies requiring in-person assessment, including participants from 18 centers from the ALLFTD study study 12 and University of California, San Francisco (UCSF) FTLD studies.To study the app in older individuals, a small group of older adults without functional impairment was recruited from the UCSF Brain Aging Network for Cognitive Health.All  The baseline participation window was divided into three 25-to 35-minute assessment sessions occurring over 11 days.All cognitive tests were repeated in every session to enhance task reliability 6,13 and enable assessment of test-retest reliability, except for card sort, which was administered once every 6 months due to expected practice effects.Adherence was defined as the percentage of all available tasks that were completed.Participants were asked to complete the triplicate of sessions every 6 months for the duration of the app study.Only the baseline triplicate was analyzed in this study.
Replicability was tested by dividing the sample into a discovery cohort (n = 258) comprising all participants enrolled until the initial data freeze (October 1, 2022) and a validation cohort (n = 102) comprising participants enrolled after October 1, 2022, and 18 pilot participants 11 who completed the first session in person with an examiner present during cognitive pretesting.Sensitivity analyses excluded this small pilot cohort.

ALLFTD Mobile App
ALLFTD investigators partnered with Datacubed Health

Clinical Assessment and Traditional Neuropsychological Measures
Criterion standard clinical data were collected during parent project visits.Syndromic diagnoses were made according to published criteria [15][16][17][18][19]   The CDR plus NACC-FTLD module is an 8-domain rating scale based on informant and participant report. 21A global score was calculated to categorize disease severity as asymptomatic or preclinical if a pathogenic variant carrier (0), prodromal (0.5), or symptomatic (1.0-3.0). 22A sum of the 8 domain box scores (CDR plus NACC-FTLD sum of boxes) was also calculated. 22rticipants completed the UDS Neuropsychological Battery, version 3.0 23 (eMethods in Supplement 1), which includes traditional neuropsychological measures and the Montreal Cognitive Assessment (MoCA), a global cognitive screen.Executive functioning and processing speed measures were summarized into a composite score (UDS3-EF). 24Participants also completed a 9-item list-learning memory test (California Verbal Learning Test, 2nd edition, Short Form). 25Most (339 [94.2%]) neuropsychological evaluations were conducted in person.In a subsample (n = 270), motor speed and dexterity were assessed using the Movement Disorder Society Uniform Parkinson Disease Rating Scale 26 Finger Tapping subscale (0 indicates no deficits [n = 240]).

Neuroimaging
We acquired T1-weighted brain magnetic resonance imaging for 199 participants.Details of image acquisition, harmonization, preprocessing, and processing are provided in eMethods in Supplement 1 and prior publications. 27Briefly, SPM12 (Statistical Parametric Mapping) was used for segmentation 28 and Large Deformation Diffeomorphic Metric Mapping for generating group templates. 29Gray matter volumes were calculated in template space by integrating voxels and dividing by total intracranial volume in 2 regions of interest (ROIs) 30 : a frontoparietal and subcortical ROI and a hippocampal ROI.Voxel-based morphometry was used to test unbiased voxel-wise associations of volume with smartphone tests (eMethods in Supplement 1). 31,32

Genetics
Participants in the ALLFTD study underwent genetic testing 33 at the University of California, Los Angeles.DNA samples were screened using targeted sequencing of a custom panel of genes previously implicated in neurodegenerative diseases, including GRN (138945) and MAPT (157140).
Hexanucleotide repeat expansions in C9orf72 (614260) were detected using both fluorescent and repeat-primed polymerase chain reaction analysis. 34

Statistical Analysis
Statistical analyses were conducted using Stata, version 17.0 (StataCorp LLC), and R, version 4.4.2(R Project for Statistical Computing).All tests were 2 sided, with a statistical significance threshold of P < .05.
Psychometric properties of the smartphone tests were explored using descriptive statistics.
Comparisons between CDR plus NACC-FTLD groups (ie, asymptomatic or preclinical, prodromal, and symptomatic) for continuous variables, including demographic characteristics and cognitive task scores (first exposure to each measure), were analyzed by fitting linear regressions.We used χ 2 difference tests for frequency data (eg, sex and race and ethnicity).

Reliability
Internal consistency, which measures reliability within a task, was estimated for participants' first exposure to each test using Cronbach α (details in eMethods in Supplement 1).Test-retest reliability was estimated using intraclass correlation coefficients for participants who completed a task at least twice; all exposures were included.Reliability estimates are described as poor (<0.500), moderate (0.500-0.749), good (0.750-0.890), and excellent (Ն0.900) 35 ; these are reporting rules of thumb, and clinical interpretation should consider raw estimates.We calculated 95% CIs via bootstrapping with 1000 samples.

Group Comparisons
To evaluate the app's ability to select participants with prodromal or symptomatic FTLD for trial enrollment, we tested discrimination of participants without symptoms from those with prodromal and symptomatic FTLD.To understand the app's utility for screening early cognitive impairment, we fit receiver operating characteristics curves testing the predictive value of the app composite, UDS3-EF, and MoCA for differentiating participants without symptoms and those with preclinical FTLD from those with prodromal FTLD; areas under the curves (AUC) for the app and MoCA were compared using the DeLong test in participants with results for both predictive factors.
We compared app performance in preclinical participants who carried pathogenic variants with that in noncarrier controls using linear regression adjusted for age (a predictive factor in earlier models).For this analysis, we excluded those younger than 45 years to remove participants likely to be years from symptom onset based on natural history studies. 4We analyzed memory performance in participants who carried MAPT pathogenic variants, as early executive deficits may be less prominent. 34,36

Participant Characteristics
Of 1163 eligible participants, 360 were enrolled, 439 were excluded, and 364 refused to participate (additional details are provided in the eResults in Supplement 1).Participant characteristics are reported in Table 1 for the full sample.The discovery and validation cohorts did not significantly differ in terms of demographic characteristics, disease severity, or cognition (eTable 1 in Supplement 1).In the full sample, there were 209 women (58.1%) and 151 men (41.9%), and the mean (SD) age was 54.0 (15.4) years (range, 18-89 years).The mean (SD) educational level was 16.5 (2.3) years (range, 12-20 years).Among the 358 participants with racial and ethnic data available, 340 (95.0%) identified as White.For the 18 participants self-identifying as being of other race or ethnicity, the specific group was not provided to protect participant anonymity.Among the 329 participants with available CDR plus NACC-FTLD scores (  2, and eFigure 1 in Supplement 1).Go/no-go reliability was particularly poor in participants without symptoms (ICC, 0.10 [95% CI, −0.37 to 0.48]) and was removed from subsequent validation analyses except the correlation matrix (Figure 3A and B).The 95% CIs for reliability estimates overlapped in the discovery and validation cohorts (Figure 2).
Reliability estimates showed overlapping 95% CIs regardless of distractions (eFigure 2 in Supplement 1) or operating systems (eFigure 3 in Supplement 1), with a pattern of slightly lower reliability estimates when distractions were endorsed for all comparisons except Stroop (Cronbach α).

Construct Validity
In c Owing to the small sample size, the specific groups are not provided to protect participant anonymity.
d Diagnostic categories were not compared across groups as diagnoses are designed to differ with increasing disease severity.Note that diagnoses such as psychiatric disorder, Parkinson disease, and Alzheimer disease syndrome are possible manifestations of FTLD; diagnoses of psychiatric disorders were documented in 2 noncarrier controls.
e Refers to identified rare pathogenic variants.The specific variation is not provided to protect participant anonymity.
f Defined as the percentage of all possible tasks that were completed.Eight sessions were removed due to software bugs that prevented participants from completing all tasks.
g Indicates compared with Android.
h Percentage of respondents reporting distractions at baseline of 282 who responded to survey, removing 3 data points reporting at least 8 of 11 distractions.More details are provided in eTables 3 and 4 in Supplement 1.
i Includes those who completed the survey (n = 326).
Associations with sex and educational level were not statistically significant.
Cognitive tests administered using the app showed evidence of convergent and divergent validity (eFigure 4 in Supplement 1), with very similar findings in discovery (Figure 3A) and validation cohorts (Figure 3B).App-based measures of executive functioning were generally correlated with criterion standard in-person measures of these domains and less with measures of other cognitive domains (r range, 0.40-0.66).For example, the flanker task was associated with the UDS3   Worse performance on all app measures was associated with greater disease severity on CDR plus NACC-FTLD (r range, 0.38-0.59)(Table 1, Figure 3, and eFigure 4 in Supplement 1).The same pattern of results was observed after excluding those with finger dexterity issues.Except for go/no-go, performance of participants with prodromal FTLD was statistically significantly worse than that of participants without symptoms on all measures (P < .001).

Figure 1 .
Figure 1.Screenshots of Smartphone Cognitive Tests and Testing ScheduleFlankerA

Figure
Figure 2. Reliability of Smartphone Cognitive Tests in a Mixed Sample of Adults Without Functional Impairment and Participants With Frontotemporal Lobar Degeneration

Figure 3 .
Figure 3. Association of Smartphone Cognitive Tests With Criterion Standards and Detection of Deficits in Early Frontotemporal Lobar Degeneration (FTLD)

JAMA Network Open | Neurology
12udy procedures were approved by the UCSF or Johns Hopkins Central Institutional Review Board.All participants or legally authorized representatives provided written informed consent.The study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.Inclusion criteria were age 18 years or older, having access to a smartphone, and reporting English as the primary language.Race and ethnicity were self reported by participants using options consistent with the National Alzheimer's Coordinating Center (NACC) Uniform Data Set (UDS) and were collected to contextualize the generalizability of these results.Participants were asked to complete tests on their own smartphones.Informants were encouraged for all participants and required for those with symptomatic FTLD (Clinical Dementia Rating Scale plus NACC FTLD module [CDR plus NACC-FTLD] global score Ն1).Recruitment targeted individuals with CDR plus NACC-FTLD global scores less than 2, but sites had discretion to enroll more severely impaired participants.Exclusion criteria were consistent with the parent ALLFTD study.12 JAMA Network Open.2024;7(4):e244266.doi:10.1001/jamanetworkopen.2024.4266(Reprinted) April 1, 2024 2/18 Downloaded from jamanetwork.comby guest on 04/06/2024 (including J.C.T., A.B.W., S.D., and M.M.) assisted participants with app download, setup, and orientation and observed participants completing the first questionnaire.All cognitive tasks were self-administered without supervision (except pilot participants, discussed below) in a predefined order with minor adjustments throughout the study.Study partners of participants with symptomatic FTLD were asked to remain nearby during participation to help navigate the ALLFTD-mApp but were asked not to assist with testing.

20 JAMA Network Open | Neurology
based on multidisciplinary conferences that considered neurological history, neurological examination results, and collateral interview.
31,32phone Cognitive Testing for Frontotemporal Lobar DegenerationValidityValidity analyses used participants' first exposure to each test.Linear regressions were fitted in participants without symptoms with age, sex, and educational level as independent variables to understand the unique contribution of each demographic factor to cognitive test scores.Correlations and linear regression between the app-based tasks and disease severity (CDR plus NACC-FTLD sum of boxes score), neuropsychological test scores, and gray matter ROIs were used to investigate construct validity in the full sample.Demographic characteristics were not entered as covariates because the primary goal was to assess associations between app-based measures and criterion standards, rather than understand the incremental predictive value of app measures.To address potential motor confounds, associations with disease severity were evaluated in a subsample without finger dexterity deficits on motor examination (using the Movement Disorder Society Uniform Parkinson Disease Rating Scale Finger Tapping subscale).To complement ROI-based neuroimaging analysis based on a priori hypotheses, we conducted voxel-based morphometry (eMethods in Supplement 1) to uncover other potential neural correlates of test performance.31,32Finally,we evaluated the association of the number of distractions and operating system with reliability and validity, controlling for age and disease severity, which are predictive factors associated with test performance in correlation analyses.

Table 1 .
Participant Characteristics and Test Scores a PsychometricsDescriptive statistics for each task are presented in Table2.Ceiling effects were not observed for any tests.A small percentage of participants were at the floor for flanker (19 [5.3%]), go/no-go (13 [4.0%]), and card sort (9 [3.3%]) scores.Floor effects were only observed in participants with prodromal or symptomatic FTLD.

Table 1 .
Participant Characteristics and Test Scores a (continued) Full sample includes all participants, including 31 who did not have an available CDR plus NACC FTLD rating.Asymptomatic includes adults without functional impairment and carriers of preclinical pathogenic variants.For continuous variables, number of participants with available data are given.For binary variables, number of participants in the category are given.