Introduction The Wood–Ljungdahl pathway (WLP) of acetogens is speculated to be the first biochemical pathway on Earth that emerged when the atmosphere was still highly reduced and rich in CO, CO 2 , and H 2 ( ; ; ). These C1 gases can be converted into acetyl-CoA through the WLP ( ; ) and acetogens are the only known organisms using the WLP as a terminal electron-accepting, energy-conserving process to fix CO 2 into biomass ( ; ). This pathway is responsible for the production of acetic acid in quantities surpassing the billion ton mark annually. It is estimated that the pathway contributes to fixing ∼20% of the CO 2 on Earth ( ; ). All this takes place with the WLP operating at the edge of thermodynamic feasibility ( ) and requires the use of the third mode of energy conservation, electron bifurcation, which likely contributed to the emergence of life on Earth ( ; ; ). Acetogens are also attractive cell factories for the sustainable production of fuels and chemicals from gaseous waste feedstocks (e.g., syngas from gasified municipal solid waste and industrial waste gases) ( ; ; ; ). While the field has advanced enormously in the last decade ( ; ), better fundamental understanding of acetogen metabolism is needed to guide rationale metabolic engineering, for example, to increase their substrate uptake or product yields. 

 Recent quantitative studies of acetogen physiology have expanded understanding of their metabolism considerably (reviewed in ; ). Although most biochemical details of the WLP are well established ( , , ) and systems-level understanding of acetogen metabolism has recently improved ( , ), key transcriptional features such as promoter motifs and transcriptional regulators controlling the expression of genes needed for autotrophic growth are yet unknown. This information could benefit acetogen metabolic engineering and improve our understanding of their complex transcriptional regulation ( ; ; ; ). Prediction of promoter motifs strictly based on computational analysis (based solely on the organism’s genome sequence) has the drawback of detection of promoter-like sequences across the genome, which is particularly pronounced in non-conserved DNA motifs ( ). An instrumental step toward more accurate promoter motif identification was the development of the differential RNA-sequencing (dRNA-Seq) technology, first described in 2010 by Sharma and colleagues ( ) for the human pathogen Helicobacter pylori . 

 dRNA-Seq enables the experimental determination of transcription start sites (TSSs) and correct mapping of TSSs enables genome-wide identification of promoters and gene expression regulatory sequences, besides providing experimental data for a more accurate genome annotation. Once a TSS has been experimentally determined, promoter sequences can be mapped from there. Thus, characterization of the transcriptional architecture (ie, TSSs and promoter motifs) and a more accurate annotation of acetogen genomes have the potential to yield valuable insights into the complex transcriptional regulation of acetogens. To date, only one study has determined TSSs in acetogens, using Eubacterium limosum ( ). Here, we used dRNA-Seq as a tool to identify the TSSs in the model-acetogen Clostridium autoethanogenum grown under autotrophic and heterotrophic conditions. The subsequent search for promoter motifs detected a previously undescribed motif associated with essential genes in acetogens. We then provide experimental evidence for the relevance of this new promoter motif (termed hereafter P cauto ) by identifying a TetR-family protein that activates gene expression from this motif by directly binding to the RNA polymerase. 



 Materials and Methods Bacterial Strains and Growth Conditions Clostridium autoethanogenum strain DSM 10061 was obtained from The German Collection of Microorganisms and Cell Cultures (DSMZ). Cells were grown as described before ( ) for acquiring samples for differential RNA-sequencing (dRNA-seq). Briefly, heterotrophic and autotrophic growth were investigated in serum bottles on fructose (5 g/L) and on steel mill off-gas (35% CO, 10% CO 2 , 2% H 2 and 53% N 2 ), respectively. Cells were grown at 37°C on a shaker (100 RPM, rounds per minute) and sampled for dRNA-Seq analysis from the exponential growth phase (OD 600 nm = 0'5–0'6). 



 Differential RNA-Sequencing (dRNA-Seq) Extraction and preparation of RNA for cDNA library construction were performed as described elsewhere ( ). Briefly, RNA was extracted using TRIzol followed by column purification with RNAeasy (Qiagen). The resulting total RNA pools were sent to Vertis Biotechnologie AG (Freisig, Germany) for sequencing. The cDNA libraries were prepared using the 5′tagRACE method ( ). Firstly, the 5′ Illumina TruSeq sequencing adapter carrying sequence tag TCGACA was ligated to the 5′-monophosphate groups (5′P) of processed transcripts (TAP- on ). Samples were then treated with Tobacco Acid Pyrophosphatase (TAP) to convert 5′-triphosphate (5′PPP) structures of primary transcripts into 5′P ends to which the 5′ Illumina TruSeq sequencing adapter carrying sequence tag GATCGA was ligated (TAP+ on ). Next, first-strand cDNA was synthesized using an N6 randomized primer to which the 3′ Illumina TruSeq sequencing adapter was ligated after fragmentation. 

 The 5′ cDNA fragments were amplified with PCR using a proofreading enzyme and primers designed for TruSeq sequencing according to the manufacturer’s instructions. The main advantage of using the 5′tagRACE method ( ) for dRNA-Seq comes from amplifying the 5′ ends of processed and primary transcripts in a single PCR reaction, which preserves their quantitative representation in an RNA pool. Finally, 5′ cDNAs were purified using the Agencourt AMPure XP Kit (Beckman Coulter Genomics) and analyzed by capillary electrophoresis before sequencing the single-end libraries using the Illumina NextSeq 500 system and a MID 150 Kit with 75 bp read length. 



 Determination of Transcription Start Sites (TSSs) Sequencing reads were aligned and mapped to the genome of C. autoethanogenum DSM 10061 (CP006763'1) using the software TopHat2 ( ) without trimming or removal of any reads. Reads were processed with the TSSAR ( TSS A nnotation R egime) software ( ) for automated de novo determination of TSSs from dRNA-Seq data using the following parameters: p -value 1e-3, Noise threshold 10, Merge range 5. The identified TSSs were classified as primary (within 250 nt upstream of an annotated gene), internal (within an annotated gene), antisense (on the opposite strand of an annotated gene), or orphan (not assigned to any of the previous classes) ( ). Since our main aim was the identification of the TSSs of essential genes for autotrophic growth in acetogens (e.g., WLP), we focused on the primary TSSs. 



 Search for Promoter Motifs and the Shine-Dalgarno Sequence To determine promoter motifs, we searched for consensus sequence motifs 50, 100, and 150 nt upstream of primary TSSs using the MEME software ( ) with the following parameters: -dna, -max size 10000000, -mod zoops, -nmotifs 50, -minw 4, -maxw 50, -revcomp, -oc. Only motifs with E -value ≤ 0'05 and at least 13 TSSs associated to it (ie, at least two genes associated to it, ) were considered and ranked based on the number of assigned TSSs ( ). 

 To search for the Shine-Dalgarno sequence, 30 nt upstream of annotated genes (CP006763'1 and NC_022592'1) were searched with the MEME software ( ) using the same parameters as in the promoter motif search, except for -nmotifs 10, -maxw 30. 



 Search for the New Promoter Motif in Acetogens Occurrence of the new promoter motif (see results) in C. autoethanogenum , C. ljungdahlii , C. ragsdalei , C. coskatii , M. thermoacetica , and E. limosum was determined using the FIMO tool ( ) within the MEME software by searching for the sequence up to 300 nt upstream of annotated genes (since no TSS data is available) with default FIMO parameters. Occurrence in each acetogen relative to C. autoethanogenum was normalized with the number of annotated genes. 



 DNA-Binding Protein Assay Firstly, C. autoethanogenum —DSM 19630—cells were acquired from autotrophic bioreactor chemostat cultures (CO or CO + H 2 ) described in a separate work ( ). Briefly, cells were grown in bioreactor chemostat cultures in the chemically defined medium on either CO or CO + H 2 at 37°C, pH = 5, dilution rate of ∼1 day (μ∼0'04 h ), and at a biomass concentration ∼1'4 gDCW/L. Cells were pelleted by immediate centrifugation (20'000 × g for 2 min at 4°C), and stored at −80 °C until analysis. 

 Frozen pellets were thawed, resuspended in BS/THES buffer described in with pH adjusted to 7'0, and passed five times through the EmulsiFlex-C5 High Pressure Homogenizer (Avestin Inc.) according to the manufacturer’s instructions, with the final sample volume adjusted to 35 mL with the BS/THES buffer. Samples were then centrifuged (35'000 × g for 15 min at 4°C) and the supernatant filtered using a 0'22 μM filter (Merck). 

 The DNA-binding protein assay was based on a pull-down/DNA affinity chromatography method described by with the following modifications. The DNA sequences were of 125 bp length containing the respective promoter sequence in the middle with flanking regions downstream and upstream. pH of the buffers was adjusted to 7. The bait-target/ligand binding step was performed with 1 mL of cell extract without the addition of non-specific competitor DNA. 

 Next, either salmon sperm (Thermo) or Poly dI-dC (Sigma) were used as non-specific competitor DNA in the subsequent washing steps. Briefly, Dynabeads M-280 Streptavidin (Thermo Fisher Scientific) were mixed with DNA containing either the promoter sequence of CAETHG_1615, 1617 (WLP genes assigned with the new promoter motif), or 3224 (a glycolytic gene as a control for our assay since it was assigned the well-known TATAAT motif, which should yield binding of the RNAP and the housekeeping σ factor σ ). Next, the cell extract was added and samples were incubated for 30 min at room temperature. This was followed by two washing steps with the BS/THES buffer ( ) to remove proteins not bound to the target DNA. Finally, protein elution was performed in Tris-HCl (pH 7) with a successively increasing concentration of NaCl (200, 300, 500, 750 mM, 1M, and 2M). The eluted protein solutions were analyzed by gel electrophoresis NuPAGE Novex Bis-Tris (Invitrogen) and visualized using Sypro Ruby (Molecular Probes) according to the manufacturer’s instructions. The 500 mM NaCl eluate yielded the most prominent bands and therefore this eluate was used for further analysis. No bands were observed in the negative control when water was used instead of DNA (data not shown), confirming that the identified proteins were pulled down by the DNA sequences. 



 Protein Digestion for Mass Spectrometry-Based Proteomics For the digestion of proteins from gel band excision, the gel bands of interest were cut and de-stained for 1 h with a buffer of 50 mM ammonium bicarbonate (ABC) in 50% acetonitrile (ACN). Following buffer removal, 50 μL of 10 mM DTT was added and samples were incubated for 30 min at 60°C to reduce disulfide bonds. Next, the DTT solution was removed, and 50 μL of 55 mM iodoacetamide (IAA) was added and samples were incubated for 30 min in the dark at room temperature to alkylate sulfhydryl groups. After removal of the IAA solution, gel pieces were washed twice with 100 μL of 50 mM ABC, and dehydrated with 100% ACN. Protein digestion was performed overnight at 37°C by rehydrating gel pieces with 50 μL of Trypsin/Lys-C mix (10 ng/μL in 25 mM ABC) and 100 μL of ABC. 

 Extraction of peptides from gel pieces was performed by repeating the following steps five times: addition of 100 μL of 0'1% formic acid (FA) in 50% ACN and sonication of samples in a water bath for 10 min. Samples were then concentrated to near dryness using a centrifugal vacuum concentrator (Eppendorf) and resuspended in 50 μL of 0'1% FA. Finally, samples were desalted using C 18 ZipTips (Merck Millipore) as follows: the column was wetted using 0'1% FA in 100% ACN, equilibrated with 0'1% FA in 70% ACN, and washed with 0'1% FA before loading the sample and washing again with 0'1% FA. Finally, peptides were eluted with 0'1% FA in 70% ACN, and then diluted 10-fold with 0'1% FA for mass spectrometry analysis. 

 For the digestion of proteins from the whole purified DNA bound material, the whole purified DNA-bound material from the DNA-protein binding assay was incubated for 30 min at 95°C. Next, 30 μL of 10 mM DTT was added and samples were incubated for 45 min at 55°C to reduce disulfide bonds. Then, 40 μL of 55 mM IAA was added and samples were incubated for 30 min in the dark at room temperature to alkylate sulfhydryl groups. Protein digestion was performed overnight at 37°C using 50 μL of Trypsin/Lys-C mix (10 ng/μL in 25 mM ABC) and stopped by lowering the pH to 3 using FA. Finally, the samples were desalted and prepared for mass spectrometry analysis as described above. 



 Protein Identification Using Mass Spectrometry Detection of proteins in both the digestion products of gel band excision and the whole captured material was performed using a QTOF Sciex 5600 or a Thermo Orbitrap Elite mass spectrometer (depending on instrument availability) with details described elsewhere ( ; ) with a modified liquid chromatographic (LC) gradient. Briefly, a Shimadzu Prominence nanoLC system was used for desalting (on an Agilent C18 trap) and separating peptides (on a Vydac Everest C18 column) using a gradient consisting of 10–60% buffer B over 30 min followed by 60–97% buffer B over 8 min, where buffer A was 1% ACN in 0'1% FA and buffer B was 80% ACN in 0'1% FA. Protein identification was performed using the software ProteinPilot v5'0 (ABSciex) with the Paragon Algorithm against the NC_022592'1 and CP006763 genome annotations with the following search parameters: Trypsin + LysC digestion; IAA as cysteine alkylation; thorough search effort; FDR analysis. Only proteins below 1% false discovery rate (FDR; estimated global) and with at least two peptides with more than 95% confidence were considered as identified. 

 Proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD014421. 



 Molecular Biology Techniques The full list of bacterial strains, plasmids, and primers used in this work for the in vivo transcription assay and protein overexpression step are shown in . Luria-Bertani (LB) broth or agar with antibiotics were used for growth. 

 Escherichia coli DH5α was used as the cloning strain and performed transformations according to the manufacturer’s instructions (BIOLINE). E. coli BL21 was used in the in vivo transcription assay and protein overexpression step. E. coli BL21 chemically competent cells were prepared using the RuCl 2 method ( ). 

 PCR amplification of targeted sequences was performed using the Phusion polymerase (NEB) and the OneTaq polymerase (NEB). Plasmid were assembled using standard ligation with the T4 DNA ligase or using Gibson assembly ( ). 

 Construction of a σ-Factor Candidate Expression System in E. coli Candidates for potential σ factors were selected based on protein identification using mass spectrometry (see above) from proteins annotated as transcriptional regulators ( ). Additionally, we also built a plasmid for the L -seryl-tRNA(Sec) selenium transferase (CAETHG_2839) (identified as a stronger band in the pull-down assay, ), and the housekeeping σ in Clostridia (σ ) (CAETHG_2917) ( ). 

 The potential σ factor candidates were cloned into plasmid pET28a+ to be expressed under the control of a T7 promoter. DNA sequences were PCR amplified using the primers shown in and purified using a QIAGEN kit. Next, the plasmid pET28a+ was linearized using restriction enzymes Nde I and Hin dIII and purified using a QIAGEN kit. Codon optimization was required to express the σ factor candidates of TetR-family protein (CAETHG_0459) and σ (CAETHG_2917) before DNA sequences were synthesized as gene block (gBlock ) fragments. 

 Plasmids with the σ factor candidates were then assembled by Gibson assembly using equimolar concentrations of the linearized backbone plasmid and the PCR fragment in a 20 μL reaction. After incubation at 50°C, 5 μL of the Gibson mix was then used to transform E. coli DH5α by heat shock. After recovery on SOC media at 37°C for 60 min, 100 μL of cells were spread on LB agar plates containing kanamycin (50 μg/mL). Plates were then incubated at 37°C for 16 h and kanamycin resistant colonies were tested by colony PCR for proper assembly using pET_conf(FWD)/pET_conf(REV) primers ( ). A colony that tested positive for assembly was then picked and grown overnight on LB media containing kanamycin. Plasmids were recovered from 5 mL of overnight culture using a QIAGEN miniprep kit and the digestion profile was verified with the assembly. Plasmids were then used to transform E. coli BL21 chemically competent cells (described above). 

 Escherichia coli BL21 strains harboring σ factor candidate-expressing plasmids were then grown overnight on LB media containing kanamycin and 1 mM IPTG (Isopropyl β- D -1-thiogalactopyranoside). Next, 2 mL of overnight culture were spun down and the supernatant was removed. The cell pellet was then resuspended in the BugBuster master mix solution (Novagen) for protein extraction following the manufacturer’s instructions. The insoluble and soluble fractions were loaded into an SDS-PAGE gel to confirm the overexpression of the σ factor candidates (data not shown). 



 Construction of a P cauto _GFP-UV Reporter Fusion System in E. coli To determine whether the σ factor candidates could activate transcription, we assembled a GFP-based reporter expression system under the control of the new promoter motif (termed here P cauto ). Firstly, plasmid pBR322 was digested with Hin dIII, purified, and used as the backbone followed by PCR amplification of the DNA sequence containing P cauto from the C. autoethanogenum genome (500 bp upstream of the start codon of the gene CAETHG_1617) and purification using a QIAGEN kit. 

 Next, the GFP-UV gene was PCR amplified from plasmid pBR_PprpR-GFPUV and purified after which the three DNA fragments were added at an equimolar concentration to a Gibson assembly mix subsequently incubated at 50°C. Five μL of the Gibson mix was used to transform chemically competent E. coli DH5α cells by heat shock and after incubation at 37°C, 100 μL of cells were spread on LB agar plates containing ampicillin (100 μg/mL) and incubated at 37°C for 16 h. Ampicillin resistant colonies were then tested by colony PCR using the primer sets of P cauto -GFP_conf(FWD-1)/P cauto -GFP_conf(REV-1) and P cauto -GFP_conf(FWD-2)/P cauto -GFP_conf(REV-2) ( ). Confirmed colonies were picked and grown overnight on LB containing antibiotic for plasmid recovery. The digestion profile confirmed the assembly of plasmid pBR_P cauto _GFP. 

 The P cauto -GFP-UV was excised from pBR_P cauto _GFP using restriction enzyme Hin dIII. Digestion mix was loaded on a 1% agarose gel and the P cauto -GFP-UV region recovered using a QIAGEN gel extraction kit. Then, the recovered DNA sequence was cloned into plasmid pACYC184, which was previously digested with Hin dIII and purified using a QIAGEN kit. 

 Ligation was performed according to the manufacturer’s instruction and 5 μL of the mix was used to transform E. coli DH5α competent cells. After heat shock and incubation, 100 μL of cells were spread on LB agar containing chloramphenicol (30 μg/mL) and incubated at 37°C for 16 h. Chloramphenicol-resistant colonies were tested by colony PCR for proper assembly. Positive colonies were then grown overnight on LB media and the plasmid was recovered. Assembly of plasmid pACYC_P cauto _GFP was confirmed by digestion profile and Sanger sequencing (AGRF, Australia) (data not shown). 



 Construction of Variants for the P cauto Promoter Motif Region Later a new reporter system including the P cauto and the GFPuv sequences was built to remove the 500 bp upstream region in pBR_P cauto _gfp. The idea was to keep only the sequence used for the pull-down assay plus including the ribosomal binding site (Shine-Dalgarno sequence) to be tested in vivo with TetR-family protein (CAETHG_0459) and σ (CAETHG_2917) (see next section), the two proteins that responded positively in the in vivo assay (see section Results). This new plasmid, pBR_P cauto 130_gfp, was built by cloning the PCR product of primers WLP130F and WLP130R using pBR_P cauto _gfp as template, at the Hin dIII site of pBR322 by Gibson assembly ( ). Then, the P cauto 130_gfp region was excised from pBR_P cauto 130_gfp using Hin dIII and Cla I and cloned by ligation in pACYC184 to build plasmid pAC_P cauto 130_gfp. A variation of the promoter region (pAC_P cauto 30C_gfp) was also built to introduce single nucleotide changes in the WLP promoter motif. Changes were as follow: ctggagcaggttttgtagttgcagtaactggttcaata , changed to ccatcaaaggtcttaaagttgcagtaactggttcaata . This promoter was again tested with the TetR-family protein (CAETHG_0459) and σ (CAETHG_2917). Plasmids maps available upon request. 



 In vivo Transcription Activation of P cauto -GFP(UV) Fusion by the Candidate Genes in E. coli Escherichia coli BL21 was used for the in vivo assay. Firstly, six biological replicate cultures of cells were grown in a 96-well plate (Corning Costar catalog number #3799) carrying the pACYC plasmid with or without (to correct for the autofluorescence of the cells) the promoter-GFPuV fusion reporter in trans with a pET plasmid carrying each of the σ factor candidates. Additionally, a system with cells carrying either the pACYC promoter-GFPuV fusion reporter or its backbone plasmid plus the pET plasmid with no candidate was used as the control. 

 Cells were grown in 150 μL of LB media containing kanamycin and chloramphenicol at 30°C and agitation of 200 RPM. At mid-exponential phase, cells were sub-cultured to a black 96-well plate (Greiner #655090) to an initial OD of 0'05–0'1 in LB media containing kanamycin and chloramphenicol supplemented with either 0'0 mM IPTG (No IPTG) or 1'0 mM IPTG. The in vivo experiment was performed at 30°C and agitation of 200 RPM. 

 Growth was followed by measuring the optical density (OD) at 600 nm while fluorescence intensity (FI; for GFP expression) was measured using the excitation filter of 355 nm and an emission filter of 520 nm. The experiment was conducted using the FLUOstar Omega microplate reader and the Omega software v.1'20 (BMG LabTech). Fluorescence intensity was normalized per OD (FI/OD) and the signal resulting from the cells harboring the backbone plasmid only was subtracted from the cells carrying the promoter-GFP fusion reporter (Normalized FI/OD). 

 For the WLP promoter motif variants (described in the previous sentence) four biological replicates were used. 

 Student’s t -test (two-tailed) was performed between each of the candidate’s normalized FI/OD value without and with IPTG and between the control system. A candidate gene was considered to activate gene expression from P cauto if it met both of the following two conditions: (1) there was a statistically significant difference ( p -value < 0'01) in FI/OD between the candidate without and with IPTG; (2) there was a statistically significant difference ( p -value < 0'01) between the FI/OD signal of the candidate and the control vector (PET_) with IPTG. 



 Overproduction and Purification of TetR-Family Protein (CAETHG_0459) To enable the test whether the TetR-family protein CAETHG_0459 activates transcription from P cauto by interacting directly with the RNAP, the target protein had to be heterologously expressed and purified for the protein-protein interaction (PPI) assay. 

 The E. coli strain harboring the plasmid pET_TetR1 (CAETHG_0459) was grown at 30°C and 200 RPM until mid-exponential phase in LB media containing kanamycin. Cells were sub-cultured to 1 L LB media containing kanamycin to an initial OD of 0'05–0'1 and subsequently grown until OD ∼1 at 30°C and 200 RPM. Then, 1'0 mM IPTG was added and cells were left growing until OD ∼3. Cells were pelleted from 1 L culture by centrifugation at 5'000 × g for 20 min at 4°C, the pellet was resuspended in 5 mL of the BugBuster Master Mix (Merck Millipore #71456) per gram of wet cell weight with EDTA-free protease inhibitor cocktail (Sigma #11836170001), and then incubated in a rotating mixer for 20 min at room temperature. Next, cells debris were removed by centrifugation at 16'000 × g for 20 min at 4°C and the supernatant (supplemented with 20 mM Imidazole) was loaded on a 1 mL Ni -HisTrapHP column (GE Healthcare #71-5027-68 AK) and washed with a buffer containing 100 mM Tris-HCl (pH 7), 100 mM NaCl, 20 mM Imidazole. 

 The TetR-family protein CAETHG_0459 was eluted in the same wash buffer containing a stepwise imidazole gradient (50–500 mM) following a buffer exchange performed using a HiTrap Desalting column (GE Healthcare #17-1408-01). Finally, the purified protein was stored in 50 mM Na 2 HPO 4 , 300 mM NaCl, pH7, 50% glycerol. Protein purity was analyzed by gel electrophoresis using NuPAGE Novex Bis-Tris (Invitrogen) and stained with SimplyBlue SafeStain (Novex). Protein concentration was measured by the Direct Detect Spectrometer (Merck Millipore). 



 TetR-Family Protein (CAETHG_0459)-RNA Polymerase Core Enzyme Interaction Experiment The PPI experiment was performed as described previously ( ) with some modifications. The purified TetR-family protein (CAETHG_0459) with 6-His-tag (2 μg) was coupled to Ni+-NTA agarose beads (Thermo #88831) in 800 μL of buffer A (50 mM Na 2 HPO 4 , 300 mM NaCl, 50 mM imidazole, pH 7). The beads coupled with the target protein were then washed three times in buffer B (50 mM Na 2 HPO 4 , 300 mM NaCl, 0'1% Tween 20, 50 mM imidazole, pH 7). Next, the beads-protein complex was incubated with E. coli RNA polymerase Core enzyme (2'5 μg) (BioLabs #M0550S) at 37°C for 2 h. After two washes in buffer A, the beads-protein complex was suspended in 15 μL of Laemmli Buffer [32'9 mM Tris-HCl, pH6'8, 13'15% (w/v) glycerol, 1'05% SDS, 0'005% bromophenol blue, 355 mM 2-mercaptoethanol], heated at 100°C for 5 min, and analyzed by gel electrophoresis using NuPAGE Novex Bis-Tris (Invitrogen) and stained with SimplyBlue SafeStain (Novex). The negative control was performed by incubating the RNA polymerase Core enzyme with Ni + -NTA agarose beads following the same procedure. 



 Electrophoresis Mobility Shift Assay (EMSA) The EMSA experiment was performed using the purified protein extract of the TetR-family protein (CAETHG_0459) and the 130 bp oligonucleotide containing the new promoter motif used for the pull-down assay using both agarose and polyacrylamide gels. The binding reaction contained 2 μL of 5x binding buffer (50 mM Tris HCl pH 8'0, 720 mM KCl, 2'6 mM EDTA, 0'5% Triton-X-100, 62'5% glycerol, 1 mM DTT), 5 μL of the extracted protein, 3 μL of a 20 μM oligonucleotide and 0'5 μL of 100x BSA. The binding reaction was incubated for 1 h at room temperature. For the reaction on the polyacrylamide gel including the E. coli RNA polymerase Core enzyme (BioLabs #M0550S), the reaction was further incubated for 30 min after adding 3 μL (3 units) of RNAP. Gel electrophoresis was performed in either 1% agarose or 7'5% polyacrylamide gels. Controls with only the extracted protein or DNA were also loaded. For agarose, the entire sample was loaded and run at 90 V for 120 min. For polyacrylamide, the entire sample was loaded and run at 150 V for 50 min. The gels were stained with SYBR Safe and visualized using BIO-RAD ChemiDoc. 



 Visualization of Cells Harboring the P cauto -GFP(UV) Fusion and the TetR-Family Protein (CAETHG_0459) Plasmids by Microscopy Cells carrying the P cauto -GFP(UV) fusion reporter and the TetR-family protein (CAETHG_0459) plasmids were analyzed by microscopy to visualize the expression of GFP. For this, cells were plated in a LB agar plate (LB media containing 6 g/L of agar) containing 1'0 mM IPTG, kanamycin, and chloramphenicol. After overnight incubation at 37°C, colonies were visualized using the ZOE Fluorescent Cell Imager (Bio-Rad) using the manufacturer’s instructions and following parameters: Gain: 40; Exposure (ms): 340; LED intensity: 22; Contrast: 59. 







 Results Differential RNA-Sequencing (dRNA-Seq) In this work, we aimed to determine the TSSs of essential genes for autotrophic growth of the model-acetogen C. autoethanogenum (e.g., genes in the WLP and of hydrogenases). We thus performed dRNA-Seq analysis ( ) of autotrophic (CO, CO 2 , and H 2 ; referred to as ‘syngas’) and heterotrophic (fructose) cultures of C. autoethanogenum to experimentally determine TSSs and promoter motif(s) associated with essential genes for autotrophic growth in acetogens. 

 Previously described batch cultures ( ) were sampled during exponential growth and subjected to dRNA-Seq cDNA library preparation and sequencing. The cDNA libraries were prepared using the 5′tagRACE method ( ), an improved library preparation method compared to TEX (5′-phosphate-dependent Terminator RNA exonuclease) that has the advantage of preserving the quantitative representation of 5′ ends between processed (5′-P end) and primary (5′-PPP end) transcripts (see Materials and Methods). TSSs were determined by comparing the libraries enriched for processed (TAP-) and primary (TAP+) transcripts ( ) using the TSSAR tool ( ). 



 Overall dRNA-Seq Features of C. autoethanogenum We classified TSSs as primary, internal, antisense, and orphan ( and ) and found primary TSSs only for around half of the annotated genes (3'983) in C. autoethanogenum ( ) ( ). More than 60% of the genes contain only one primary TSS, while the rest show up to 12 TSSs ( and ). Focusing on the 14 main metabolic groups of C. autoethanogenum genes as described in , we detected primary TSSs for all genes except for the Nfn transhydrogenase complex (CAETHG_1580) ( ). While primary TSSs were detected for seven of the 11 genes of the WLP biosynthetic gene cluster (CAETHG_1606-21), only half of the WLP TSSs were shared between syngas and fructose. For example, genes of the WLP methyl branch (CAETHG_1614-17) contained 20 primary TSSs on syngas compared to only nine on fructose. On the other hand, the TSSs associated with Hydrogenases and ATPase genes were found in similar numbers between syngas and fructose. 

 Determination of nucleotide base preferences for transcription initiation within five nucleotides downstream and upstream of the primary TSSs showed a clear enrichment of adenine (A) and guanine (G) at +1 (∼90%) and thymine (T) at −1 for both syngas ( ) and fructose (data not shown). Overall, adenine and cytosine were the most and least preferred nucleotide bases, respectively. 

 Analysis of 5′untranslated regions (5′UTRs)—the sequence between the TSS and the annotated start codon—indicates transcripts potentially associated with post-transcriptional regulation and thus of mRNA stability and translational efficiency ( ). Calculation of 5′UTR lengths for primary TSSs showed a median length of 63 nt with 65% of TSSs < 100 nt for both growth conditions ( and ). Genes with longer UTR lengths tend to be regulated more at the post-transcriptional level ( ; ). On the other hand, leaderless mRNAs—mRNAs with no or <10 nt 5′UTR—are translated in the absence of upstream signals (typically the Shine-Dalgarno sequence) ( ; ) used for regulating translational efficiency through ribosome binding. We found ∼70 (∼2%) leaderless mRNAs with <10 nt 5′UTRs, none of which were in the WLP, Hydrogenases, Acetate or Ethanol groups ( and ). 

 In addition to the ability to determine TSSs, dRNA-Seq analysis also facilitates a more accurate annotation of the genome. Based on the TSSs and the Shine-Dalgarno (AGGAGG) position that was found to be highly conserved within 9–14 nt upstream of the first start codon (ATG/CTG/GTG/TTG) ( ), we re-annotated the start codon for 38 genes and confirmed the changes in one gene by peptide identification using mass spectrometry ( ). Moreover, either the start or stop codon of an additional 99 genes, which had previously been annotated in different frames, were manually corrected. The corrections have been deposited into NCBI under the accession number BK010482 and the complete manually corrected genbank file of C. autoethanogenum is available in . 



 Discovery of a New Promoter Motif The RNA polymerase (RNAP) needs to form a holoenzyme with a σ factor in bacteria to recognize a specific promoter motif (sequence) and initiate transcription ( ; ). Experimentally determined TSS data from dRNA-Seq analysis is ideal for in silico determination of promoter motifs, which is important for understanding transcriptional regulation, especially in less-studied bacteria such as acetogens. 

 We searched for consensus sequence motifs 50 nt upstream of primary TSSs using the MEME software ( ) and were able to determine seven promoter motifs in C. autoethanogenum ( E -value ≤ 0'05) ( , for syngas and fructose growth, respectively). Of those identified, only three motifs were assigned with more than 100 TSSs and shared between the two datasets, likely representing the most conserved motifs in C. autoethanogenum ( ). 

 The top motif was found 10 nt upstream of primary TSSs (447 and 543 TSSs for syngas and fructose, respectively; E -value < 10 ) and resembles the Pribnow box (TATGnTATAAT), which is associated with the housekeeping σ factors of Escherichia coli (σ ; ), Helicobacter pylori (σ ; ) and Clostridium acetobutylicum (σ ; , ). Expectedly, the well-known −35 TTGACA and −10 TATAAT motifs (TATA box in eukaryotes and archaea) for housekeeping σ factors ( ) was also among the top-3 promoter consensus sequences (392 and 262 TSSs for syngas and fructose, respectively; E -value < 10 ). These two motifs were assigned for most of the genes of glycolysis/gluconeogenesis and the TCA cycle ( ). 

 The third most abundant promoter motif has, to the best of our knowledge, not previously been reported in the literature ( ). The new promoter motif (termed here P cauto ), is highly conserved both during growth on syngas (Motif 02 in ; 392 TSSs; E -value < 10 ) and fructose (Motif 03 in ; 224 TSSs; E -value < 10 ). Importantly, P cauto seems to be involved in the transcriptional regulation of essential genes for acetogens and was assigned to genes of the WLP cluster (CAETHG_1606-21) and the metabolic groups, as described in , of Hydrogenases, Acetate, ATPase, and Pyruvate ( and , , ). We confirmed the unique presence of the “new promoter motif” upstream of the TSSs. Investigation of its upstream regions up to 100 or 150 nt showed no other motif apart from the one conserved within 50 nt upstream of TSSs. This new promoter is well characterized by an evenly interspaced (A/T)G repetition with an almost central A/T position ( ). These observations potentially indicate the presence of a new σ factor or transcriptional regulator of critical importance in acetogens. 



 RNA Polymerase and Proteins Annotated as Transcriptional Regulators Specifically Bind P cauto We performed DNA-protein binding assays to determine if the RNAP and/or other protein(s) bind to P cauto . The promoter sequences of two WLP genes (CAETHG_1615 and 1617, Methylene-tetrahydrofolate reductase domain-containing protein and Methenyl-tetrahydrofolate cyclohydrolase, respectively) annotated with P cauto were used for the DNA-protein binding assay using the promoter pull down/DNA affinity chromatography method ( ; ). The promoter sequence of a glycolytic gene (CAETHG_3424, glyceraldehyde-3-phosphate dehydrogenase, type I) was included as a control for the assay since it was assigned the well-known TATAAT motif, which should yield binding of the RNAP and the housekeeping σ factor, σ . DNA-bound proteins captured using streptavidin-coupled magnetic Dynabeads were identified using mass spectrometry of the digestion products of the whole captured material and of gel band excisions. Since this DNA-protein binding assay requires significant amounts of cellular protein material, especially for efforts to identify low abundance proteins such as σ factors or transcriptional regulators, autotrophic bioreactor chemostat cultures (CO or CO + H 2 ) of C. autoethanogenum described in a separate work ( ) were sampled for this analysis. 

 The promoter pull down/DNA affinity chromatography method ( ; ) was fine-tuned for C. autoethanogenum . Eluting the proteins with 500 mM NaCl yielded the most prominent bands while no bands were observed in the negative control when water was used instead of DNA (data not shown), which confirms that the identified proteins were pulled down by the DNA sequences (see section Materials and Methods). The alpha and beta subunits of the RNAP (CAETHG_1920 and 1954-55) were successfully identified for both P cauto (CAETHG_1615 and CAETHG_1617) and the TATAAT motif control ( ). Additionally, the RNAP omega subunit was identified in the whole purified DNA-bound material for both motifs ( ). The housekeeping σ (CAETHG_2917) was detected for the TATAAT motif control as expected ( ). A stronger band was identified in the P cauto gels around 50 kDa and identified as a protein annotated as L -seryl-tRNA(Sec) selenium transferase (CAETHG_2839; 51'5 kDa) ( ). Finally, mass spectrometry analysis of the whole purified DNA-bound material identified three proteins annotated as transcriptional regulators (based on NC_022592'1) that were unique for the P cauto ( ) and found for both CO and CO + H 2 cultures across technical replicates of the DNA-protein binding assay ( ). These proteins were likely not visible on the DNA-protein binding assay gels as both their respective mRNA and protein abundances in C. autoethanogenum are very low ( , ). 



 TetR-Family Transcriptional Regulator (CAETHG_0459) Activates Transcription From P cauto in vivo To determine whether any of the three identified protein candidates annotated as transcriptional regulators that uniquely bind to P cauto ( ) could activate transcription from this promoter, we created a transcriptional fusion reporter vector harboring the sequence of P cauto in-frame with a green fluorescence protein (GFPuV). We also tested transcriptional activation using the L -seryl-tRNA(Sec) selenium transferase (CAETHG_2839) (identified as a stronger band in the pull-down assay, ), and using the housekeeping σ factor in clostridia (σ ) (CAETHG_2917), since it has been reported that promoter binding sites of different σ factors can overlap ( ). Transcriptional activation of P cauto with concomitant GFP production was investigated in E. coli by inducing the expression of the candidate activator proteins from a second T7 protein over-expression vector cloned into plasmid pET28e+ by the addition of IPTG (see section Materials and Methods). Fluorescence was measured at early exponential growth (OD ∼0'26) as FI/OD. 

 After subtracting the signal from cells harboring the two plasmids but lacking the fusion reporter (promoter + GFP, see section Materials and Methods), only induction of the TetR-family transcriptional regulator protein (CAETHG_0459) (out of the three transcriptional regulator candidates) led to statistically higher levels of GFP expression ( p < 0'01) compared to the control vector with no candidate ( and ). Interestingly, induction of σ also led to transcription activation ( p < 0'01). We then confirmed expression of GFP in the strain expressing CAETHG_0459 grown on a plate with IPTG using fluorescence microscopy ( ). This shows that both CAETHG_0459 and σ independently activate transcription from P cauto . Importantly, the motif is associated with the expression of essential genes in gas-fermenting acetogens including genes in the WLP and hydrogenases ( , , ) that show higher transcript expression during growth on gas compared to sugar ( ). 

 The 130 bp variant (which includes the sequence used for the pull-down assay plus the ribosomal binding site) also showed statistically significance ( p -value < 0'01) of fluorescence increase when TetR-family transcriptional regulator protein (CAETHG_0459) was present. Similarly, σ could also activate transcription, however, only at the level of p -value < 0'05. Interestingly when mutations were included in the promoter motif, TetR (CAETHG_0459) could no longer activate expression of GFP, as expected ( ). Additionally, electrophoretic mobility shift assay (EMSA) experiments confirmed binding of the TetR-family protein (CAETHG_0459) together with the RNAP to the 130 bp promoter sequence used for the pull-down assay ( ). We tried to test the effect of TetR (CAETHG_0459) expression knock-down on the phenotype of C. autoethanogenum but were unsuccessful in obtaining reproducible results. 



 CAETHG_0459 Directly Binds to the RNA Polymerase Core Enzyme As TetR-family proteins often act as transcriptional regulators ( ), we next investigated whether TetR-family protein CAETHG_0459 activates transcription from P cauto by interacting directly with the RNAP. Transcriptional regulators can reversibly interact with the RNAP Core enzyme independently of a DNA sequence to help activate transcription from a range of promoters ( ; ). We thus performed an in vitro PPI assay to test whether protein CAETHG_0459 directly interacts with RNA polymerase Core in the absence of DNA. The purified His-tagged CAETHG_0459 protein linked to Ni -beads was incubated with the RNAP Core enzyme (see section Materials and Methods). SDS-PAGE analysis clearly demonstrated an interaction between the core RNA polymerase and CAETHG_0459 ( lane 6) and shows that CAETHG_0459 acts as a positive transcriptional regulator that activates transcription from P cauto by directly binding to the RNAP. 



 P cauto Is Represented in Other Acetogens We next investigated if P cauto was represented in other industrially relevant acetogens with available genomes: Clostridium ljungdahlii , C. ragsdalei , C. coskatii , Moorella thermoacetica , and Eubacterium limosum ( ; ; ; ). We performed the reverse of the methodology previously used to search for consensus sequence motifs by looking for the occurrence of P cauto 300 nt upstream of annotated genes (since no TSS data was available) using the FIMO tool ( ) within MEME. As expected based on their phylogenetic proximity ( ; ; ), C. ljungdahlii , C. ragsdalei , and C. coskatii showed similar occurrences of P cauto ( ). Interestingly, while the representation in M. thermoacetica was very low, P cauto seems to be present also in E. limosum . This result highlights the need for experimental determination of TSSs in more acetogens. 





 Discussion Acetogens offer an enormous potential for the production of fuels and chemicals from gaseous waste feedstocks ( ; ; ; ), with ethanol already being produced at industrial scale by LanzaTech. Acetogens have two major carbon fixation pathways: the WLP for autotrophic growth and glycolysis for heterotrophic growth. Although both the WLP and glycolysis/gluconeogenesis pathways operate during autotrophic and heterotrophic growth, the WLP carries a substantially higher metabolic flux during autotrophy ( , ) and vice versa ( ). Transcriptomic studies have shown that transcriptional regulation between autotrophic and heterotrophic growth in acetogens is complex and includes many non-obvious expression changes ( ; ; ; ). We thus aimed to determine TSSs and transcriptional features of promoter motifs and transcriptional regulators associated with essential genes (including genes of the WLP) in the model-acetogen C. autoethanogenum . 

 Our study revealed a new promoter motif and the identification of two proteins activating gene expression from the new motif [the TetR-family protein (CAETHG_0459) and the housekeeping σ (CAETHG_2917)]. An alternative TetR transcriptional regulator has been previously found to be a σ factor in Clostridium tetani , and its homologs, TcdR in C. difficile , BotR in C. botulinum , and UviA in C. perfringens have also been found to regulate toxin production ( ; ; ). In combination, these results suggest that TetR proteins can play an important role in transcriptional regulation in clostridia . These studies support our PPI assay potentially suggesting that the TetR-family protein might function as a σ factor in C. autoethanogenum , but further studies ( in vitro transcription assay) are needed to confirm this. In fact, unequivocal demonstration of σ factor activity requires that a protein is necessary and sufficient for activation of promoter recognition and transcription initiation by RNAP, independent of any other σ factor subunit. Thus our results do not exclude the possibility that a native σ factor of the in vivo expression host ( E. coli ), e.g., σ , could have induced the TetR-family protein to drive transcription from P cauto . Additional studies should also be performed to study whether both the σ and the TetR-family protein show an overlap in the promoter motif for transcriptional activation ( ). 

 Notably, there are several TetR-family proteins, commonly regarded as transcriptional regulators ( ), annotated in the C. autoethanogenum genome. In pathogenic clostridia these TetR-family proteins are often described as alternative σ factors, belonging to a class of σ factors called extracytoplasmic function (ECF) σ factors ( ; ). Their discovery led to a novel class of σ factors (group 5), which show a −35 and/or −10 conserved region in their target promoters ( , ; ; ). It will be interesting to see whether transcription from P cauto described here with an interspaced repetition of (A/T)G notably distinct from the canonical −35/−10 conserved regions is also activated by a novel σ factor. 

 Our work also shows that the housekeeping σ factor (σ ) in clostridia can activate transcription from P cauto associated with essential genes for autotropic growth in acetogens. Interestingly, in another acetogen E. limosum , the promoter regions of genes of the WLP, hydrogenases, and ATPase contain the well-known –35 TTGACA and –10 TATAAT motifs for the housekeeping σ factor (σ ) ( ; ). This potentially indicates that the housekeeping σ in acetogens can initiate transcription from different promoter motifs and illustrates well the great extent of genetic diversity among the non-taxonomic group of acetogens. While the WLP itself is highly conserved, it is not surprising that transcriptional regulation is diverse ( ; ). The work presented here also highlights the importance of P cauto in other industrially relevant acetogens ( ). We believe, however, that more studies are needed for the experimental determination of TSSs and transcriptional features to facilitate a broader understanding of transcriptional regulation in acetogens. 

 Our findings have the potential to significantly advance the understanding of transcriptional regulation and metabolic engineering of the ancient metabolism of acetogens. Firstly, acetogen metabolism, which operates at the thermodynamic edge of feasibility ( ), seems to be wired for utilizing less energy-consuming mechanisms (ie, transcriptional vs. translational regulation) for operating under different conditions evidenced by the complexity of the condition-specific transcriptional architecture ( ). More importantly, the discovery of P cauto and a key positive transcription factor (TetR-family protein) in acetogens can lead to the mechanistic description of transcriptional regulation of arguably the first biochemical pathway on Earth ( ; ; ). 

 In addition to expanding the fundamental understanding of a model acetogen, knowledge of the features controlling the expression of essential genes in acetogens could also contribute for the improvement of commercial gas fermentation for the sustainable production of fuels and chemicals. Increasing or modulation of the activity of the described TetR transcription factor (either through over-expression and/or protein engineering or by deleting transcriptional repressor genes) could enhance the uptake of C 1 substrates through the WLP and thus improve growth and/or product formation (possibly by introducing P cauto in front of key genes). It could also be used as an orthologous system in other organisms, as, for instance, the TcdR system has been used in other Clostridium species ( ; ). Importantly, the newly discovered promoter P cauto could be harnessed to couple expression of heterologous pathways to mimic those of key central metabolism enzymes, potentially alleviating the common problem of imbalanced flux throughput between heterologous and native metabolic pathways. 



 Data Availability Statement dRNA-Seq data have been deposited in the NCBI Gene Expression Omnibus depository under accession number GSE108700 . Re-annotation of C. autoethanogenum genome was deposited in the NCBI GenBank Third Party Annotation database under accession number BK010482. Proteomics data have been deposited to the ProteomeXchange Consortium ( http://proteomecentral proteomexchange.org ) via the PRIDE partner repository with the dataset identifier PXD014421. 



 Author Contributions RS, KV, MK, RT, LN, and EM designed the study and experiments. RS, RG, RP, KV, and CB performed the experiments. RS, KV, RT, RP, MK, SS, LN, and EM analyzed and interpreted the data. RS, KV, RT, and EM wrote the manuscript. All authors reviewed the manuscript. RT, MK, and SS were involved in the experimental design, data analysis and interpretation, and writing of the manuscript. 



 Conflict of Interest Conflict of Interest: MK, RT, and SS are employed by LanzaTech. The authors declare that this study received funding from the Australian Research Council (ARC LP140100213) in collaboration with LanzaTech. The ARC was not involved in the study design, collection, analysis, interpretation of data, and the writing of this article or the decision to submit it for publication. LanzaTech was involved in the study design, collection, analysis, interpretation of data, and the writing of this article and the decision to submit it for publication. LanzaTech has interest in commercializing technology using C. autoethanogenum . The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.